Book Image

Machine Learning: End-to-End guide for Java developers

By : Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath
Book Image

Machine Learning: End-to-End guide for Java developers

By: Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath

Overview of this book

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning. The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books: [*]Java for Data Science [*]Machine Learning in Java [*]Mastering Java Machine Learning On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
Table of Contents (5 chapters)

Chapter 3. Data Cleaning

Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.

We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:

  • Validity: Ensuring that the data possesses the correct form or structure
  • Accuracy: The values within the data are truly representative of the dataset
  • Completeness: There are no missing elements
  • Consistency: Changes to data are in sync
  • Uniformity: The same units of measurement are used

There are several techniques and tools used to clean data. We will examine the following approaches:

  • Handling different types of data
  • Cleaning and manipulating text data
  • Filling in missing data
  • Validating data

In addition, we will briefly examine several image enhancement techniques.

There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.

We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use specialized libraries.

These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.

The basic text based tasks include:

  • Data transformation
  • Data imputation (handling missing data)
  • Subsetting data
  • Sorting data
  • Validating data

In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.

Handling data formats

Data comes in all types of forms. We will examine the more commonly used formats and show how they can be extracted from various data sources. Before we can clean data it needs to be extracted from a data source such as a file. In this section, we will build upon the introduction to data formats found in Chapter 2, Data Acquisition, and show how to extract all or part of a dataset. For example, from an HTML page we may want to extract only the text without markup. Or perhaps we are only interested in its figures.

These data formats can be quite complex. The intent of this section is to illustrate the basic techniques commonly used with that data format. Full treatment of a specific data format is beyond the scope of this book. Specifically, we will introduce how the following data formats can be processed from Java:

  • CSV data
  • Spreadsheets
  • Portable Document Format, or PDF files
  • Javascript Object Notation, or JSON files

There are many other file types not addressed here. For example, jsoup is useful for parsing HTML documents. Since we introduced how this is done in the Web scraping in Java section of Chapter 2, Data Acquisition, we will not duplicate the effort here.

Handling CSV data

A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.

We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner class to read in tokens and the String class split method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.

However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.

We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.

Note

A jagged array is an array where the number of columns may be different for different rows. For example, row 2 may have 5 elements while row 3 may have 6 elements. When using jagged arrays you have to be careful with your column indexes.

First, we use the Scanner class to read in data from our data file. We will temporarily store the data in an ArrayList since we will not always know how many rows our data contains.

try (Scanner csvData = new Scanner(new File("Demographics.csv"))) {
    ArrayList<String> list = new ArrayList<String>();         
    while (csvData.hasNext()) {             
        list.add(csvData.nextLine());     
} catch (FileNotFoundException ex) {     
    // Handle exceptions 
} 

The list is converted to an array using the toArray method. This version of the method uses a String array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.

String[] tempArray = list.toArray(new String[1]); 
String[][] csvArray = new String[tempArray.length][]; 

The split method is used to create an array of Strings for each row. This array is assigned to a row of the csvArray.

for(int i=0; i<tempArray.length; i++) { 
    csvArray[i] = tempArray[i].split(","); 
} 

Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.

First, we need to create an instance of the CSVReader class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll method.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
ArrayList<String> holdData = (ArrayList)dataReader.readAll();

We can then process the data as we did above, by splitting the data into a two-dimension array using String class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
String[] nextLine; 
while ((nextLine = dataReader.readNext()) != null){ 
for(String token : nextLine){ 
    out.println(token); 
  } 
} 
dataReader.close(); 

We can now clean or otherwise process the array.

Handling spreadsheets

Spreadsheets have proven to be a very popular tool for processing numeric and textual data. Due to the wealth of information that has been stored in spreadsheets over the past decades, knowing how to extract information from spreadsheets enables us to take advantage of this widely available data source. In this section, we will demonstrate how this is accomplished using the Apache POI API.

Open Office also supports a spreadsheet application. Open Office documents are stored in XML format which makes it readily accessible using XML parsing technologies. However, the Apache ODF Toolkit (http://incubator.apache.org/odftoolkit/) provides a means of accessing data within a document without knowing the format of the OpenOffice document. This is currently an incubator project and is not fully mature. There are a number of other APIs that can assist in processing OpenOffice documents as detailed on the Open Document Format (ODF) for developers (http://www.opendocumentformat.org/developers/) page.

Handling Excel spreadsheets

Apache POI (http://poi.apache.org/index.html) is a set of APIs providing access to many Microsoft products including Excel and Word. It consists of a series of components designed to access a specific Microsoft product. An overview of these components is found at http://poi.apache.org/overview.html.

In this section we will demonstrate how to read a simple Excel spreadsheet using the XSSF component to access Excel 2007+ spreadsheets. The Javadocs for the Apache POI API is found at https://poi.apache.org/apidocs/index.html.

We will use a simple Excel spreadsheet consisting of a series of rows containing an ID along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet follows:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

We start with a try-with-resources block to handle any IOExceptions that may occur:

try (FileInputStream file = new FileInputStream( 
        new File("Sample.xlsx"))) { 
    ... 
    } 
} catch (IOException e) { 
    // Handle exceptions 
} 

An instance of a XSSFWorkbook class is created using the spreadsheet. Since a workbook may consists of multiple spreadsheets, we select the first one using the getSheetAt method.

XSSFWorkbook workbook = new XSSFWorkbook(file); 
XSSFSheet sheet = workbook.getSheetAt(0); 

The next step is to iterate through the rows, and then each column, of the spreadsheet:

for(Row row : sheet) { 
    for (Cell cell : row) { 
        ... 
    } 
out.println(); 

Each cell of the spreadsheet may use a different format. We use the getCellType method to determine its type and then use the appropriate method to extract the data in the cell. In this example we are only working with numeric and text data.

switch (cell.getCellType()) { 
    case Cell.CELL_TYPE_NUMERIC: 
        out.print(cell.getNumericCellValue() + "\t"); 
        break; 
    case Cell.CELL_TYPE_STRING: 
        out.print(cell.getStringCellValue() + "\t"); 
        break; 
   } 

When executed we get the following output:

ID Minimum Maximum Average 
12345.0 45.0 89.0 65.55
23456.0 78.0 96.0 86.75
34567.0 56.0 89.0 67.44
45678.0 86.0 99.0 95.67

POI supports other more sophisticated classes and methods to extract data.

Handling PDF files

There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.

This is a simple PDF file. It consists of several bullets:

  • Line 1
  • Line 2
  • Line 3

This is the end of the document.

A try block is used to catch  IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:

try { 
    PDDocument document = PDDocument.load(new File("PDF File.pdf")); 
    ... 
} catch (Exception e) { 
    // Handle exceptions 
} 

Once loaded, the PDFTextStripper class getText method will extract the text from the file. The text is then displayed as shown here:

PDFTextStripper Tstripper = new PDFTextStripper(); 
String documentText = Tstripper.getText(document); 
System.out.println(documentText); 

The output of this example follows. Notice that the bullets are returned as question marks.

This is a simple PDF file. It consists of several bullets: 
? Line 1 
? Line 2 
? Line 3 
This is the end of the document.

This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.

Handling JSON

In Chapter 2, Data Acquisition we learned that certain YouTube searches return JSON formatted results. Specifically, the SearchResult class holds information relating to a specific search. In that section we illustrate how to use YouTube specific techniques to extract information. In this section we will illustrate how to extract JSON information using the Jackson JSON implementation.

JSON supports three models for processing data:

  • Streaming API - JSON data is processed token by token
  • Tree model - The JSON data is held entirely in memory and then processed
  • Data binding - The JSON data is transformed to a Java object

Using JSON streaming API

We will illustrate the first two approaches. The first approach is more efficient and is used when a large amount of data is processed. The second technique is convenient but the data must not be too large. The third technique is useful when it is more convenient to use specific Java classes to process data. For example, if the JSON data represent an address then a specific Java address class cane be defined to hold and process the data.

There are several Java libraries that support JSON processing including:

We will use the Jackson Project (https://github.com/FasterXML/jackson). Documentation is found at https://github.com/FasterXML/jackson-docs. We will use two JSON files to demonstrate how it can be used. The first file, Person.json, is shown next where a single person data is stored. It consists of four fields where the last field is an array of location information.

{  
   "firstname":"Smith", 
   "lastname":"Peter",  
   "phone":8475552222, 
   "address":["100 Main Street","Corpus","Oklahoma"]  
} 

The code sequence that follows shows how to extract the values for each of the fields. Within the try-catch block a JsonFactory instance is created which then creates a JsonParser instance based on the Person.json file.

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    ... 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The nextToken method returns a token. However, the JsonParser object keeps track of the current token. In the while loop the nextToken method returns and advances the parser to the next token. The getCurrentName method returns the field name for the token. The while loop terminates when the last token is reached.

while (parser.nextToken() != JsonToken.END_OBJECT) { 
    String token = parser.getCurrentName(); 
    ... 
} 

The body of the loop consists of a series of if statements that processes the field by its name. Since the address field is an array, another loop will extract each of its elements until the ending array token is reached.

if ("firstname".equals(token)) { 
    parser.nextToken(); 
    String fname = parser.getText(); 
    out.println("firstname : " + fname); 
} 
if ("lastname".equals(token)) { 
    parser.nextToken(); 
    String lname = parser.getText(); 
    out.println("lastname : " + lname); 
} 
if ("phone".equals(token)) { 
    parser.nextToken(); 
    long phone = parser.getLongValue(); 
    out.println("phone : " + phone); 
} 
if ("address".equals(token)) { 
    out.println("address :"); 
    parser.nextToken(); 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        out.println(parser.getText()); 
    } 
} 

The output of this example follows:

firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma

However, JSON objects are frequently more complex than the previous example. Here a Persons.json file consists of an array of three persons:

{ 
   "persons": { 
      "groupname": "school", 
      "person": 
         [  
            {"firstname":"Smith", 
              "lastname":"Peter",  
              "phone":8475552222, 
              "address":["100 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"King", 
              "lastname":"Sarah",  
              "phone":8475551111, 
              "address":["200 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"Frost", 
              "lastname":"Nathan",  
              "phone":8475553333, 
              "address":["300 Main Street","Corpus","Oklahoma"] } 
         ] 
   } 
} 

To process this file, we use a similar set of code as shown previously. We create the parser and then enter a loop as before:

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    while (parser.nextToken() != JsonToken.END_OBJECT) { 
        String token = parser.getCurrentName(); 
        ... 
    } 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

However, we need to find the persons field and then extract each of its elements. The groupname field is extracted and displayed as shown here:

if ("persons".equals(token)) { 
    JsonToken jsonToken = parser.nextToken(); 
    jsonToken = parser.nextToken(); 
    token = parser.getCurrentName(); 
    if ("groupname".equals(token)) { 
        parser.nextToken(); 
        String groupname = parser.getText(); 
        out.println("Group : " + groupname); 
        ... 
    } 
} 

Next, we find the person field and call a parsePerson method to better organize the code:

parser.nextToken(); 
token = parser.getCurrentName(); 
if ("person".equals(token)) { 
    out.println("Found person"); 
    parsePerson(parser); 
} 

The parsePerson method follows which is very similar to the process used in the first example.

public void parsePerson(JsonParser parser) throws IOException { 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        String token = parser.getCurrentName(); 
        if ("firstname".equals(token)) { 
            parser.nextToken(); 
            String fname = parser.getText(); 
            out.println("firstname : " + fname); 
        } 
        if ("lastname".equals(token)) { 
            parser.nextToken(); 
            String lname = parser.getText(); 
            out.println("lastname : " + lname); 
        } 
        if ("phone".equals(token)) { 
            parser.nextToken(); 
            long phone = parser.getLongValue(); 
            out.println("phone : " + phone); 
        } 
        if ("address".equals(token)) { 
            out.println("address :"); 
            parser.nextToken(); 
            while (parser.nextToken() != JsonToken.END_ARRAY) { 
                out.println(parser.getText()); 
            } 
        } 
    } 
} 

The output follows:

Group : school
Found person
firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma
firstname : King
lastname : Sarah
phone : 8475551111
address :
200 Main Street
Corpus
Oklahoma
firstname : Frost
lastname : Nathan
phone : 8475553333address :
300 Main Street
Corpus
Oklahoma

Using the JSON tree API

The second approach is to use the tree model. An ObjectMapper instance is used to create a JsonNode instance using the Persons.json file. The fieldNames method returns Iterator allowing us to process each element of the file.

try { 
    ObjectMapper mapper = new ObjectMapper(); 
    JsonNode node = mapper.readTree(new File("Persons.json")); 
    Iterator<String> fieldNames = node.fieldNames(); 
    while (fieldNames.hasNext()) { 
        ... 
        fieldNames.next(); 
    } 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Since the JSON file contains a persons field, we will obtain a JsonNode instance representing the field and then iterate over each of its elements.

JsonNode personsNode = node.get("persons"); 
Iterator<JsonNode> elements = personsNode.iterator(); 
while (elements.hasNext()) { 
    ... 
} 

Each element is processed one at a time. If the element type is a string, we assume that this is the groupname field.

JsonNode element = elements.next(); 
JsonNodeType nodeType = element.getNodeType(); 
 
if (nodeType == JsonNodeType.STRING) { 
    out.println("Group: " + element.textValue()); 
} 

If the element is an array, we assume it contains a series of persons where each person is processed by the parsePerson method:

if (nodeType == JsonNodeType.ARRAY) { 
    Iterator<JsonNode> fields = element.iterator(); 
    while (fields.hasNext()) { 
        parsePerson(fields.next()); 
    } 
}

The parsePerson method is shown next:

public void parsePerson(JsonNode node) { 
    Iterator<JsonNode> fields = node.iterator(); 
    while(fields.hasNext()) { 
        JsonNode subNode = fields.next(); 
        out.println(subNode.asText()); 
    } 
}

The output follows:

Group: school
Smith
Peter
8475552222
King
Sarah
8475551111
Frost
Nathan
8475553333

There is much more to JSON than we are able to illustrate here. However, this should give you an idea of how this type of data can be handled.

The nitty gritty of cleaning text

Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.

Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.

There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:

Java provides many supports for cleaning text data, including methods in the String class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String class methods in a moment. Some of the most helpful String class methods are summarized in the following table:

Method Name

Return Type

Description

trim

String

Removes leading and trailing blank spaces

toUpperCase/toLowerCase

String

Changes the casing of the entire string

replaceAll

String

Replaces all occurrences of a character sequence within the string

contains

boolean

Determines whether a given character sequence exists within the string

compareTo

compareToIgnoreCase

int

Compares two strings lexographically and returns an integer representing their relationship

matches

boolean

Determines whether the string matches a given regular expression

join

String

Combines two or more strings with a specified delimiter

split

String[]

Separates elements of a given string using a specified delimiter

Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.

A regular expression is simply a string itself. For example, the string Hello, my name is Sally can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \\w will match any text that starts with Hello, my name is and ends with a word character.

We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.

Option

Description

\d

Any digit: 0-9

\D

Any non-digit

\s

Any whitespace character

\S

Any non-whitespace character

\w

Any word character (including digits): A-Z, a-z, and 0-9

\W

Any non-word character

The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:

String dirtyText = "Call me Ishmael. Some years ago- never mind how"; 
dirtyText += " long precisely - having little or no money in my purse,"; 
dirtyText += " and nothing particular to interest me on shore, I thought";  
dirtyText += " I would sail about a little and see the watery part of the world."; 

Using Java tokenizers to extract words

Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Using Java tokenizers to extract words

Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("[email protected]")); 
out.println(validateEmail("[email protected]")); 
out.println(validateEmail("myEmail")); 

The output follows:

[email protected] is a valid email address
[email protected] is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Cleaning images

While image processing is a complex task, we will introduce a few techniques to clean and extract information from an image. This will provide the reader with some insight into image processing. We will also demonstrate how to extract text data from an image using Optical Character Recognition (OCR).

There are several techniques used to improve the quality of an image. Many of these require tweaking of parameters to get the improvement desired. We will demonstrate how to:

  • Enhance an image's contrast
  • Smooth an image
  • Brighten an image
  • Resize an image
  • Convert images to different formats

We will use OpenCV (http://opencv.org/), an open source project for image processing. There are several classes that we will use:

  • Mat: This represents an n-dimensional array holding image data such as channel, grayscale, or color values
  • Imgproc: Possesses many methods that process an image
  • Imgcodecs: Possesses methods to read and write image files

The OpenCV Javadocs is found at http://docs.opencv.org/java/2.4.9/. In the examples that follow, we will use Wikipedia images since they can be freely downloaded. Specifically we will use the following images:

Changing the contrast of an image

Here we will demonstrate how to enhance a black-and-white image of a parrot. The Imgcodecs class's imread method reads in the image. Its second parameter specifies the type of color used by the image, which is grayscale in this case. A new Mat object is created for the enhanced image using the same size and color type as the original.

The actual work is performed by the equalizeHist method. This equalizes the histogram of the image which has the effect of normalizing the brightness and increases the contrast of the image. An image histogram is a histogram representing the tonal distribution of an image. Tonal is also referred to as lightness. It represents the variation in the brightness found in an image.

The last step is to write out the image.

Mat source = Imgcodecs.imread("GrayScaleParrot.png", 
        Imgcodecs.CV_LOAD_IMAGE_GRAYSCALE); 
Mat destination = new Mat(source.rows(), source.cols(), source.type()); 
Imgproc.equalizeHist(source, destination); 
Imgcodecs.imwrite("enhancedParrot.jpg", destination); 

The following is the original image:

Changing the contrast of an image

The enhanced image follows:

Changing the contrast of an image

Smoothing an image

Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.

The following example loads an image of a cat and repeatedly applies the blur method to the image. In this example, the process is repeated 25 times. Increasing the number of iterations will result in more blur or smoothing.

The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = source.clone(); 
for (int i = 0; i < 25; i++) { 
    Mat sourceImage = destination.clone(); 
    Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); 
} 
Imgcodecs.imwrite("smoothCat.jpg", destination); 

The following is the original image:

Smoothing an image

The enhanced image follows:

Smoothing an image

Brightening an image

The convertTo method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = new Mat(source.rows(), source.cols(), 
        source.type()); 
source.convertTo(destination, -1, 1, 50); 
Imgcodecs.imwrite("brighterCat.jpg", destination); 

The enhanced image follows:

Brightening an image

Resizing an image

Sometimes it is desirable to resize an image. The resize method shown next illustrates how this is done. The image is read in and a new Mat object is created. The resize method is then applied where the width and height are specified in the Size object parameter. The resized image is then saved:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat resizeimage = new Mat(); 
Imgproc.resize(source, resizeimage, new Size(250, 250)); 
Imgcodecs.imwrite("resizedCat.jpg", resizeimage); 

The enhanced image follows:

Resizing an image

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Changing the contrast of an image

Here we will demonstrate how to enhance a black-and-white image of a parrot. The Imgcodecs class's imread method reads in the image. Its second parameter specifies the type of color used by the image, which is grayscale in this case. A new Mat object is created for the enhanced image using the same size and color type as the original.

The actual work is performed by the equalizeHist method. This equalizes the histogram of the image which has the effect of normalizing the brightness and increases the contrast of the image. An image histogram is a histogram representing the tonal distribution of an image. Tonal is also referred to as lightness. It represents the variation in the brightness found in an image.

The last step is to write out the image.

Mat source = Imgcodecs.imread("GrayScaleParrot.png", 
        Imgcodecs.CV_LOAD_IMAGE_GRAYSCALE); 
Mat destination = new Mat(source.rows(), source.cols(), source.type()); 
Imgproc.equalizeHist(source, destination); 
Imgcodecs.imwrite("enhancedParrot.jpg", destination); 

The following is the original image:

Changing the contrast of an image

The enhanced image follows:

Changing the contrast of an image

Smoothing an image

Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.

The following example loads an image of a cat and repeatedly applies the blur method to the image. In this example, the process is repeated 25 times. Increasing the number of iterations will result in more blur or smoothing.

The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = source.clone(); 
for (int i = 0; i < 25; i++) { 
    Mat sourceImage = destination.clone(); 
    Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); 
} 
Imgcodecs.imwrite("smoothCat.jpg", destination); 

The following is the original image:

Smoothing an image

The enhanced image follows:

Smoothing an image

Brightening an image

The convertTo method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = new Mat(source.rows(), source.cols(), 
        source.type()); 
source.convertTo(destination, -1, 1, 50); 
Imgcodecs.imwrite("brighterCat.jpg", destination); 

The enhanced image follows:

Brightening an image

Resizing an image

Sometimes it is desirable to resize an image. The resize method shown next illustrates how this is done. The image is read in and a new Mat object is created. The resize method is then applied where the width and height are specified in the Size object parameter. The resized image is then saved:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat resizeimage = new Mat(); 
Imgproc.resize(source, resizeimage, new Size(250, 250)); 
Imgcodecs.imwrite("resizedCat.jpg", resizeimage); 

The enhanced image follows:

Resizing an image

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Smoothing an image

Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.

The following example loads an image of a cat and repeatedly applies the blur method to the image. In this example, the process is repeated 25 times. Increasing the number of iterations will result in more blur or smoothing.

The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = source.clone(); 
for (int i = 0; i < 25; i++) { 
    Mat sourceImage = destination.clone(); 
    Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); 
} 
Imgcodecs.imwrite("smoothCat.jpg", destination); 

The following is the original image:

Smoothing an image

The enhanced image follows:

Smoothing an image

Brightening an image

The convertTo method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = new Mat(source.rows(), source.cols(), 
        source.type()); 
source.convertTo(destination, -1, 1, 50); 
Imgcodecs.imwrite("brighterCat.jpg", destination); 

The enhanced image follows:

Brightening an image

Resizing an image

Sometimes it is desirable to resize an image. The resize method shown next illustrates how this is done. The image is read in and a new Mat object is created. The resize method is then applied where the width and height are specified in the Size object parameter. The resized image is then saved:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat resizeimage = new Mat(); 
Imgproc.resize(source, resizeimage, new Size(250, 250)); 
Imgcodecs.imwrite("resizedCat.jpg", resizeimage); 

The enhanced image follows:

Resizing an image

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Brightening an image

The convertTo method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat destination = new Mat(source.rows(), source.cols(), 
        source.type()); 
source.convertTo(destination, -1, 1, 50); 
Imgcodecs.imwrite("brighterCat.jpg", destination); 

The enhanced image follows:

Brightening an image

Resizing an image

Sometimes it is desirable to resize an image. The resize method shown next illustrates how this is done. The image is read in and a new Mat object is created. The resize method is then applied where the width and height are specified in the Size object parameter. The resized image is then saved:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat resizeimage = new Mat(); 
Imgproc.resize(source, resizeimage, new Size(250, 250)); 
Imgcodecs.imwrite("resizedCat.jpg", resizeimage); 

The enhanced image follows:

Resizing an image

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Resizing an image

Sometimes it is desirable to resize an image. The resize method shown next illustrates how this is done. The image is read in and a new Mat object is created. The resize method is then applied where the width and height are specified in the Size object parameter. The resized image is then saved:

Mat source = Imgcodecs.imread("cat.jpg"); 
Mat resizeimage = new Mat(); 
Imgproc.resize(source, resizeimage, new Size(250, 250)); 
Imgcodecs.imwrite("resizedCat.jpg", resizeimage); 

The enhanced image follows:

Resizing an image

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Converting images to different formats

Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite method to convert the image to the new format:

Mat source = Imgcodecs.imread("cat.jpg"); 
Imgcodecs.imwrite("convertedCat.jpg", source); 
Imgcodecs.imwrite("convertedCat.jpeg", source); 
Imgcodecs.imwrite("convertedCat.webp", source); 
Imgcodecs.imwrite("convertedCat.png", source); 
Imgcodecs.imwrite("convertedCat.tiff", source); 

The images can now be used for specialized processing if necessary.

Summary

Many times, half the battle in data science is manipulating data so that it is clean enough to work with. In this chapter, we examined many techniques for taking real-world, messy data and transforming it into workable datasets. This process is generally known as data cleaning, wrangling, reshaping, or munging. Our focus was on core Java techniques, but we also examined third-party libraries.

Before we can clean data, we need to have a solid understanding of the format of our data. We discussed CSV data, spreadsheets, PDF, and JSON file types, as well as provided several examples of manipulating text file data. As we examined text data, we looked at multiple approaches for processing the data, including tokenizers, Scanners, and BufferedReaders. We showed ways to perform simple cleaning operations, remove stop words, and perform find and replace functions.

This chapter also included a discussion on data imputation and the importance of identifying and rectifying missing data situations. Missing data can cause problems during data analysis and we proposed different methods for dealing with this problem. We demonstrated how to retrieve subsets of data and sort data as well.

Finally, we discussed image cleaning and demonstrated several methods of modifying image data. This included changing contrast, smoothing, brightening, and resizing information. We concluded with a discussion on extracting text imposed on an image.

With this background, we will introduce basic statistical methods and their Java support in the next chapter.