Book Image

Machine Learning: End-to-End guide for Java developers

By : Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath
Book Image

Machine Learning: End-to-End guide for Java developers

By: Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath

Overview of this book

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning. The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books: [*]Java for Data Science [*]Machine Learning in Java [*]Mastering Java Machine Learning On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
Table of Contents (5 chapters)

Chapter 9. Text Analysis

Text analysis is a broad topic and is typically referred to as Natural Language Processing (NLP). It is used for many different tasks, including text searching, language translation, sentiment analysis, speech recognition, and classification, to mention a few. The process of analyzing can be difficult due to the particularities and ambiguity found in natural languages. However, there has been a considerable amount of work in this area and there are several Java APIs supporting this effort.

We will start with an introduction to the basic concepts and tasks used in NLP. These include the following:

  • Tokenization: The process of splitting text into individual tokens or words.
  • Stop words: These are words that are common and may not be necessary for processing. They include such words as the, a, and to.
  • Name Entity Recognition (NER): This is the process of identifying elements of text such as people's name, locations, or things.
  • Parts of Speech (POS): This identifies the grammatical parts of a sentence such as noun, verb, adjective, and so on.
  • Relationships: Here, we are concerned with identifying how parts of text are related to each other, such as the subject and object of a sentence.

The concepts of words, sentences, and paragraphs are well known. However, extracting and analyzing these components is not always that straightforward. The term corpus frequently refers to a collection of text.

As with most data science problems, it is important to preprocess text. Frequently, this involves handling such tasks as these:

  • Handling Unicode
  • Converting text to uppercase or lowercase
  • Removing stop words

We examined several techniques for tokenization and removing stop words in Chapter 3, Data Cleaning. In this chapter, we will focus on POS, NER, extracting relationships from sentence, text classification, and sentiment analysis.

There are several NLP APIs available, including these:

We will use OpenNLP and DL4J to demonstrate text analysis in this chapter. We chose these because they are both well-known and have good published resources for additional support.

We will use the Google Word2Vec and Doc2Vec neural networks to perform text classification. This includes feature vectors based on other words as well as using labeled information to classify documents. Finally, we will discuss sentiment analysis. This type of analysis seeks to assign meaning to text and also uses the Word2Vec network.

We start our discussion with NER.

Implementing named entity recognition

This is sometimes referred to as finding people and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb.

In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names.

Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences:

Jim headed north. Dakota headed south.

If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present.

Using OpenNLP to perform NER

We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and en-ner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java:

try (InputStream tokenStream =  
            new FileInputStream(new File("en-token.bin")); 
        InputStream personModelStream = new FileInputStream( 
            new File("en-ner-person.bin"));) { 
    ... 
} catch (Exception ex) { 
    // Handle exceptions 
} 

An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence:

TokenizerModel tm = new TokenizerModel(tokenStream); 
TokenizerME tokenizer = new TokenizerME(tm); 

The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names:

TokenNameFinderModel tnfm = new 
  TokenNameFinderModel(personModelStream); 
NameFinderME nf = new NameFinderME(tnfm); 

To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method:

String sentence = "Mrs. Wilson went to Mary's house for dinner."; 
String[] tokens = tokenizer.tokenize(sentence); 

The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here:

Span[] spans = nf.find(tokens); 

This array holds information about person entities found in the sentence. We then display this information as shown here:

for (int i = 0; i < spans.length; i++) { 
    out.println(spans[i] + " - " + tokens[spans[i].getStart()]); 
} 

The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the "Mrs.":

[1..2) person - Wilson
[4..5) person - Mary

Once these entities have been extracted, we can use them for specialized analysis.

Identifying location entities

We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model:

try (InputStream tokenStream =  
            new FileInputStream("en-token.bin"); 
        InputStream locationModelStream = new FileInputStream( 
            new File("en-ner-location.bin"));) { 
 
    TokenizerModel tm = new TokenizerModel(tokenStream); 
    TokenizerME tokenizer = new TokenizerME(tm); 
 
    TokenNameFinderModel tnfm =  
        new TokenNameFinderModel(locationModelStream); 
    NameFinderME nf = new NameFinderME(tnfm); 
 
    sentence = "Enid is located north of Oklahoma City."; 
    String tokens[] = tokenizer.tokenize(sentence); 
 
    Span spans[] = nf.find(tokens); 
 
    for (int i = 0; i < spans.length; i++) { 
        out.println(spans[i] + " - " +  
        tokens[spans[i].getStart()]); 
    } 
} catch (Exception ex) { 
    // Handle exceptions 
} 

With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name:

[5..7) location - Oklahoma

Suppose we use the following sentence:

sentence = "Pond Creek is located north of Oklahoma City."; 

Then we get this output:


[1..2) location - Creek
[6..8) location - Oklahoma

Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.

We may also be interested in how text can be classified. We will examine one approach in the next section.

Classifying text

Classifying text is an important part of machine learning and data science. We have to be able to classify text for a variety of applications, including document retrieval and web searches. It is often important to assign specific labels to the data before we can determine its usefulness for a particular application or search result.

In this chapter, we are going to demonstrate a technique involving the use of paragraph vectors and labeled data with DL4J classes. This example allows us to read in documents and, based on the text inside of the document, assign a label (or classification) to the document. We are also going to show an example of classifying text by similarity. This means we will match phrases and words that have similar structure. This example will also use DL4J.

Word2Vec and Doc2Vec

We will be using Word2Vec and Doc2Vec in a few examples in this chapter. Word2Vec is a neural network with two layers used for text processing. Given a body of text, the network will provide feature vectors for the words contained in the text. These vectors are simply mathematical representations of the word features and can be numerically compared to other vectors. This comparison is often referred to as the distance between two words.

Word2Vec operates with the understanding that words can be classified by determining the probability that two words will occur together. Because of this methodology, Word2Vec can be used for more than classification of sentences. Any object or data that can be represented by text labels can be classified with this network.

Doc2Vec is an extension of Word2Vec. Rather than building vectors representing the features of individual words compared to other words, as Word2Vec does, this network compares words to given labels. The vectors are set up to represent the theme or overall meaning of a document. Our next example shows how these feature vectors are then associated with specific documents.

Classifying text by labels

In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.

In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.

First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors object to create our vectors, a LabelAwareIterator object to iterate through our data, and a TokenizerFactory object to tokenize our data:

ParagraphVectors pVect; 
LabelAwareIterator iter; 
TokenizerFactory tFact; 
 

Then we will set up our ClassPathResource. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource. We also specify that we will use the CommonPreprocessor to preprocess our data:

ClassPathResource resource = new  
         ClassPathResource("paravec/labeled"); 
 
iter = new FileLabelAwareIterator.Builder() 
        .addSourceFolder(resource.getFile()) 
        .build(); 
 
tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 

Next, we build our ParagraphVectors. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors, we call the fit method to train our model using the training data in the paravec/labeled directory:

pVect = new ParagraphVectors.Builder() 
        .learningRate(0.025) 
        .minLearningRate(0.001) 
        .batchSize(1000) 
        .epochs(20) 
        .iterate(iter) 
        .trainWordVectors(true) 
        .tokenizerFactory(tFact) 
        .build(); 
 
pVect.fit(); 

Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource for our unlabeled data and create a new FileLabelAwareIterator as well:

ClassPathResource unlabeledText =  
         new ClassPathResource("paravec/unlabeled"); 
FileLabelAwareIterator unlabeledIter =  
         new FileLabelAwareIterator.Builder() 
               .addSourceFolder(unlabeledText.getFile()) 
               .build(); 

The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.

Next, we set up a MeansBuilder and LabelSeeker object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors. The InMemoryLookupTable class provides access to a default table for word lookup:

MeansBuilder mBuilder =  
   new MeansBuilder((InMemoryLookupTable<VocabWord>)  
      pVect.getLookupTable(),tFact); 
LabelSeeker lSeeker =  
    new LabelSeeker(iter.getLabelsSource().getLabels(), 
               (InMemoryLookupTable<VocabWord>)
    pVect.getLookupTable()); 

Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:

while (unlabeledIter.hasNextDocument()) { 
    LabelledDocument doc = unlabeledIter.nextDocument(); 
    INDArray docCentroid = mBuilder.documentAsVector(doc); 
    List<Pair<String, Double>> scores =  
              lSeeker.getScores(docCentroid); 
    out.println("Document '" + doc.getLabel() +  
       "' falls into the following categories: "); 
    for (Pair<String, Double> score : scores) { 
       out.println ("        " + score.getFirst() + ": " +  
             score.getSecond()); 
        } 
 
} 

The output from our preceding print statements is as follows:

Document 'finance' falls into the following categories: 
finance: 0.2889593541622162
health: 0.11753179132938385
science: 0.021202782168984413
Document 'health' falls into the following categories: 
finance: 0.059537000954151154
health: 0.27373185753822327
science: 0.07699354737997055

In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.

Classifying text by similarity

In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors class we used in the previous example. To begin, download the raw_sentences.txt file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.

First, we set up our ClassPathResource and assign an iterator to handle our file data. We have used a SentenceIterator for this example:

ClassPathResource srcFile = new  
      ClassPathResource("/raw_sentences.txt"); 
File file = srcFile.getFile(); 
SentenceIterator iter = new BasicLineIterator(file); 
 

Next, we will again use TokenizerFactory to tokenize our data. We also want to create a new LabelsSource object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_:

TokenizerFactory tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 
LabelsSource labelFormat = new LabelsSource("LINE_"); 

Now we are ready to build our ParagraphVectors. Our setup process includes these methods: minWordFrequency, which specifies the minimum word frequency to use in the training corpus, and iterations, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource, defined before, and our iterator and tokenizer. The trainWordVectors method specifies whether word and document representations should be built together. Finally, sampling determines whether subsampling should occur or not. We then call our build and fit methods:

ParagraphVectors vec = new ParagraphVectors.Builder() 
        .minWordFrequency(1) 
        .iterations(5) 
        .epochs(1) 
        .layerSize(100) 
        .learningRate(0.025) 
        .labelsSource(labelFormat) 
        .windowSize(5) 
        .iterate(iter) 
        .trainWordVectors(false) 
        .tokenizerFactory(tFact) 
        .sampling(0) 
        .build(); 
 
vec.fit(); 
 

Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1, the indexing process begins at 0. So, for example, line 9836 in the document will be associated with the label LINE_9835. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity method takes two labels and returns the relative distance between them in the form of  double:

double similar1 = vec.similarity("LINE_9835", "LINE_12492"); 
out.println("Comparing lines 9836 & 12493  
       ('This is my house .'/'This is my world .')  
       Similarity = " + similar1); 
 
 
double similar2 = vec.similarity("LINE_3720", "LINE_16392"); 
out.println("Comparing lines 3721 & 16393  
       ('This is my way .'/'This is my work .')  
       Similarity = " + similar2); 
 
double similar3 = vec.similarity("LINE_6347", "LINE_3720"); 
out.println("Comparing lines 6348 & 3721  
       ('This is my case .'/'This is my way .')  
       Similarity = " + similar3); 
 
double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); 
out.println("Comparing lines 3721 & 9853  
       ('This is my way .'/'We now have one .')  
       Similarity = " + dissimilar1); 
 
double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); 
out.println("Comparing lines 3721 & 3720  
       ('This is my way .'/'At first he says no .')  
       Similarity = " + dissimilar2); 

The output of our print statements is shown as follows. Compare the result of the similarity method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:

16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4]
Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494
Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001
Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362
Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935
Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619

Although this example uses ParagraphVectors like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.

Word2Vec and Doc2Vec

We will be using Word2Vec and Doc2Vec in a few examples in this chapter. Word2Vec is a neural network with two layers used for text processing. Given a body of text, the network will provide feature vectors for the words contained in the text. These vectors are simply mathematical representations of the word features and can be numerically compared to other vectors. This comparison is often referred to as the distance between two words.

Word2Vec operates with the understanding that words can be classified by determining the probability that two words will occur together. Because of this methodology, Word2Vec can be used for more than classification of sentences. Any object or data that can be represented by text labels can be classified with this network.

Doc2Vec is an extension of Word2Vec. Rather than building vectors representing the features of individual words compared to other words, as Word2Vec does, this network compares words to given labels. The vectors are set up to represent the theme or overall meaning of a document. Our next example shows how these feature vectors are then associated with specific documents.

Classifying text by labels

In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.

In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.

First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors object to create our vectors, a LabelAwareIterator object to iterate through our data, and a TokenizerFactory object to tokenize our data:

ParagraphVectors pVect; 
LabelAwareIterator iter; 
TokenizerFactory tFact; 
 

Then we will set up our ClassPathResource. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource. We also specify that we will use the CommonPreprocessor to preprocess our data:

ClassPathResource resource = new  
         ClassPathResource("paravec/labeled"); 
 
iter = new FileLabelAwareIterator.Builder() 
        .addSourceFolder(resource.getFile()) 
        .build(); 
 
tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 

Next, we build our ParagraphVectors. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors, we call the fit method to train our model using the training data in the paravec/labeled directory:

pVect = new ParagraphVectors.Builder() 
        .learningRate(0.025) 
        .minLearningRate(0.001) 
        .batchSize(1000) 
        .epochs(20) 
        .iterate(iter) 
        .trainWordVectors(true) 
        .tokenizerFactory(tFact) 
        .build(); 
 
pVect.fit(); 

Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource for our unlabeled data and create a new FileLabelAwareIterator as well:

ClassPathResource unlabeledText =  
         new ClassPathResource("paravec/unlabeled"); 
FileLabelAwareIterator unlabeledIter =  
         new FileLabelAwareIterator.Builder() 
               .addSourceFolder(unlabeledText.getFile()) 
               .build(); 

The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.

Next, we set up a MeansBuilder and LabelSeeker object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors. The InMemoryLookupTable class provides access to a default table for word lookup:

MeansBuilder mBuilder =  
   new MeansBuilder((InMemoryLookupTable<VocabWord>)  
      pVect.getLookupTable(),tFact); 
LabelSeeker lSeeker =  
    new LabelSeeker(iter.getLabelsSource().getLabels(), 
               (InMemoryLookupTable<VocabWord>)
    pVect.getLookupTable()); 

Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:

while (unlabeledIter.hasNextDocument()) { 
    LabelledDocument doc = unlabeledIter.nextDocument(); 
    INDArray docCentroid = mBuilder.documentAsVector(doc); 
    List<Pair<String, Double>> scores =  
              lSeeker.getScores(docCentroid); 
    out.println("Document '" + doc.getLabel() +  
       "' falls into the following categories: "); 
    for (Pair<String, Double> score : scores) { 
       out.println ("        " + score.getFirst() + ": " +  
             score.getSecond()); 
        } 
 
} 

The output from our preceding print statements is as follows:

Document 'finance' falls into the following categories: 
finance: 0.2889593541622162
health: 0.11753179132938385
science: 0.021202782168984413
Document 'health' falls into the following categories: 
finance: 0.059537000954151154
health: 0.27373185753822327
science: 0.07699354737997055

In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.

Classifying text by similarity

In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors class we used in the previous example. To begin, download the raw_sentences.txt file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.

First, we set up our ClassPathResource and assign an iterator to handle our file data. We have used a SentenceIterator for this example:

ClassPathResource srcFile = new  
      ClassPathResource("/raw_sentences.txt"); 
File file = srcFile.getFile(); 
SentenceIterator iter = new BasicLineIterator(file); 
 

Next, we will again use TokenizerFactory to tokenize our data. We also want to create a new LabelsSource object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_:

TokenizerFactory tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 
LabelsSource labelFormat = new LabelsSource("LINE_"); 

Now we are ready to build our ParagraphVectors. Our setup process includes these methods: minWordFrequency, which specifies the minimum word frequency to use in the training corpus, and iterations, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource, defined before, and our iterator and tokenizer. The trainWordVectors method specifies whether word and document representations should be built together. Finally, sampling determines whether subsampling should occur or not. We then call our build and fit methods:

ParagraphVectors vec = new ParagraphVectors.Builder() 
        .minWordFrequency(1) 
        .iterations(5) 
        .epochs(1) 
        .layerSize(100) 
        .learningRate(0.025) 
        .labelsSource(labelFormat) 
        .windowSize(5) 
        .iterate(iter) 
        .trainWordVectors(false) 
        .tokenizerFactory(tFact) 
        .sampling(0) 
        .build(); 
 
vec.fit(); 
 

Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1, the indexing process begins at 0. So, for example, line 9836 in the document will be associated with the label LINE_9835. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity method takes two labels and returns the relative distance between them in the form of  double:

double similar1 = vec.similarity("LINE_9835", "LINE_12492"); 
out.println("Comparing lines 9836 & 12493  
       ('This is my house .'/'This is my world .')  
       Similarity = " + similar1); 
 
 
double similar2 = vec.similarity("LINE_3720", "LINE_16392"); 
out.println("Comparing lines 3721 & 16393  
       ('This is my way .'/'This is my work .')  
       Similarity = " + similar2); 
 
double similar3 = vec.similarity("LINE_6347", "LINE_3720"); 
out.println("Comparing lines 6348 & 3721  
       ('This is my case .'/'This is my way .')  
       Similarity = " + similar3); 
 
double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); 
out.println("Comparing lines 3721 & 9853  
       ('This is my way .'/'We now have one .')  
       Similarity = " + dissimilar1); 
 
double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); 
out.println("Comparing lines 3721 & 3720  
       ('This is my way .'/'At first he says no .')  
       Similarity = " + dissimilar2); 

The output of our print statements is shown as follows. Compare the result of the similarity method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:

16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4]
Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494
Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001
Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362
Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935
Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619

Although this example uses ParagraphVectors like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.

Classifying text by labels

In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.

In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.

First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors object to create our vectors, a LabelAwareIterator object to iterate through our data, and a TokenizerFactory object to tokenize our data:

ParagraphVectors pVect; 
LabelAwareIterator iter; 
TokenizerFactory tFact; 
 

Then we will set up our ClassPathResource. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource. We also specify that we will use the CommonPreprocessor to preprocess our data:

ClassPathResource resource = new  
         ClassPathResource("paravec/labeled"); 
 
iter = new FileLabelAwareIterator.Builder() 
        .addSourceFolder(resource.getFile()) 
        .build(); 
 
tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 

Next, we build our ParagraphVectors. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors, we call the fit method to train our model using the training data in the paravec/labeled directory:

pVect = new ParagraphVectors.Builder() 
        .learningRate(0.025) 
        .minLearningRate(0.001) 
        .batchSize(1000) 
        .epochs(20) 
        .iterate(iter) 
        .trainWordVectors(true) 
        .tokenizerFactory(tFact) 
        .build(); 
 
pVect.fit(); 

Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource for our unlabeled data and create a new FileLabelAwareIterator as well:

ClassPathResource unlabeledText =  
         new ClassPathResource("paravec/unlabeled"); 
FileLabelAwareIterator unlabeledIter =  
         new FileLabelAwareIterator.Builder() 
               .addSourceFolder(unlabeledText.getFile()) 
               .build(); 

The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.

Next, we set up a MeansBuilder and LabelSeeker object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors. The InMemoryLookupTable class provides access to a default table for word lookup:

MeansBuilder mBuilder =  
   new MeansBuilder((InMemoryLookupTable<VocabWord>)  
      pVect.getLookupTable(),tFact); 
LabelSeeker lSeeker =  
    new LabelSeeker(iter.getLabelsSource().getLabels(), 
               (InMemoryLookupTable<VocabWord>)
    pVect.getLookupTable()); 

Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:

while (unlabeledIter.hasNextDocument()) { 
    LabelledDocument doc = unlabeledIter.nextDocument(); 
    INDArray docCentroid = mBuilder.documentAsVector(doc); 
    List<Pair<String, Double>> scores =  
              lSeeker.getScores(docCentroid); 
    out.println("Document '" + doc.getLabel() +  
       "' falls into the following categories: "); 
    for (Pair<String, Double> score : scores) { 
       out.println ("        " + score.getFirst() + ": " +  
             score.getSecond()); 
        } 
 
} 

The output from our preceding print statements is as follows:

Document 'finance' falls into the following categories: 
finance: 0.2889593541622162
health: 0.11753179132938385
science: 0.021202782168984413
Document 'health' falls into the following categories: 
finance: 0.059537000954151154
health: 0.27373185753822327
science: 0.07699354737997055

In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.

Classifying text by similarity

In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors class we used in the previous example. To begin, download the raw_sentences.txt file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.

First, we set up our ClassPathResource and assign an iterator to handle our file data. We have used a SentenceIterator for this example:

ClassPathResource srcFile = new  
      ClassPathResource("/raw_sentences.txt"); 
File file = srcFile.getFile(); 
SentenceIterator iter = new BasicLineIterator(file); 
 

Next, we will again use TokenizerFactory to tokenize our data. We also want to create a new LabelsSource object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_:

TokenizerFactory tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 
LabelsSource labelFormat = new LabelsSource("LINE_"); 

Now we are ready to build our ParagraphVectors. Our setup process includes these methods: minWordFrequency, which specifies the minimum word frequency to use in the training corpus, and iterations, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource, defined before, and our iterator and tokenizer. The trainWordVectors method specifies whether word and document representations should be built together. Finally, sampling determines whether subsampling should occur or not. We then call our build and fit methods:

ParagraphVectors vec = new ParagraphVectors.Builder() 
        .minWordFrequency(1) 
        .iterations(5) 
        .epochs(1) 
        .layerSize(100) 
        .learningRate(0.025) 
        .labelsSource(labelFormat) 
        .windowSize(5) 
        .iterate(iter) 
        .trainWordVectors(false) 
        .tokenizerFactory(tFact) 
        .sampling(0) 
        .build(); 
 
vec.fit(); 
 

Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1, the indexing process begins at 0. So, for example, line 9836 in the document will be associated with the label LINE_9835. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity method takes two labels and returns the relative distance between them in the form of  double:

double similar1 = vec.similarity("LINE_9835", "LINE_12492"); 
out.println("Comparing lines 9836 & 12493  
       ('This is my house .'/'This is my world .')  
       Similarity = " + similar1); 
 
 
double similar2 = vec.similarity("LINE_3720", "LINE_16392"); 
out.println("Comparing lines 3721 & 16393  
       ('This is my way .'/'This is my work .')  
       Similarity = " + similar2); 
 
double similar3 = vec.similarity("LINE_6347", "LINE_3720"); 
out.println("Comparing lines 6348 & 3721  
       ('This is my case .'/'This is my way .')  
       Similarity = " + similar3); 
 
double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); 
out.println("Comparing lines 3721 & 9853  
       ('This is my way .'/'We now have one .')  
       Similarity = " + dissimilar1); 
 
double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); 
out.println("Comparing lines 3721 & 3720  
       ('This is my way .'/'At first he says no .')  
       Similarity = " + dissimilar2); 

The output of our print statements is shown as follows. Compare the result of the similarity method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:

16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4]
Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494
Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001
Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362
Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935
Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619

Although this example uses ParagraphVectors like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.

Classifying text by similarity

In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors class we used in the previous example. To begin, download the raw_sentences.txt file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.

First, we set up our ClassPathResource and assign an iterator to handle our file data. We have used a SentenceIterator for this example:

ClassPathResource srcFile = new  
      ClassPathResource("/raw_sentences.txt"); 
File file = srcFile.getFile(); 
SentenceIterator iter = new BasicLineIterator(file); 
 

Next, we will again use TokenizerFactory to tokenize our data. We also want to create a new LabelsSource object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_:

TokenizerFactory tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 
LabelsSource labelFormat = new LabelsSource("LINE_"); 

Now we are ready to build our ParagraphVectors. Our setup process includes these methods: minWordFrequency, which specifies the minimum word frequency to use in the training corpus, and iterations, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource, defined before, and our iterator and tokenizer. The trainWordVectors method specifies whether word and document representations should be built together. Finally, sampling determines whether subsampling should occur or not. We then call our build and fit methods:

ParagraphVectors vec = new ParagraphVectors.Builder() 
        .minWordFrequency(1) 
        .iterations(5) 
        .epochs(1) 
        .layerSize(100) 
        .learningRate(0.025) 
        .labelsSource(labelFormat) 
        .windowSize(5) 
        .iterate(iter) 
        .trainWordVectors(false) 
        .tokenizerFactory(tFact) 
        .sampling(0) 
        .build(); 
 
vec.fit(); 
 

Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1, the indexing process begins at 0. So, for example, line 9836 in the document will be associated with the label LINE_9835. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity method takes two labels and returns the relative distance between them in the form of  double:

double similar1 = vec.similarity("LINE_9835", "LINE_12492"); 
out.println("Comparing lines 9836 & 12493  
       ('This is my house .'/'This is my world .')  
       Similarity = " + similar1); 
 
 
double similar2 = vec.similarity("LINE_3720", "LINE_16392"); 
out.println("Comparing lines 3721 & 16393  
       ('This is my way .'/'This is my work .')  
       Similarity = " + similar2); 
 
double similar3 = vec.similarity("LINE_6347", "LINE_3720"); 
out.println("Comparing lines 6348 & 3721  
       ('This is my case .'/'This is my way .')  
       Similarity = " + similar3); 
 
double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); 
out.println("Comparing lines 3721 & 9853  
       ('This is my way .'/'We now have one .')  
       Similarity = " + dissimilar1); 
 
double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); 
out.println("Comparing lines 3721 & 3720  
       ('This is my way .'/'At first he says no .')  
       Similarity = " + dissimilar2); 

The output of our print statements is shown as follows. Compare the result of the similarity method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:

16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4]
Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494
Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001
Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362
Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935
Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619

Although this example uses ParagraphVectors like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.

Understanding tagging and POS

POS is concerned with identifying the types of components found in a sentence. For example, this sentence has several elements, including the verb "has", several nouns such as "example" and "elements", and adjectives such as "several". Tagging, or more specifically POS tagging, is the process of associating element types to words.

POS tagging is useful as it adds more information about the sentence. We can ascertain the relationship between words and often their relative importance. The results of tagging are often used in later processing steps.

This task can be difficult as we are unable to rely upon a simple dictionary of words to determine their type. For example, the word lead can be used as both a noun and as a verb. We might use it in either of the following two sentences:

He took the lead in the play.
Lead the way!

POS tagging will attempt to associate the proper label to each word of a sentence.

Using OpenNLP to identify POS

To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.

We will be using the POSModel class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin model. This has been trained on English text using what is called maximum entropy.

Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.

We will also use the POSTaggerME class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.

We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:

try (InputStream input = new FileInputStream( 
        new File("en-pos-maxent.bin"));) { 
    String sentence = "Let's parse this sentence."; 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words. The first part uses the Scanner class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List class's toArray method is used to create the array of strings:

List<String> list = new ArrayList<>(); 
Scanner scanner = new Scanner(sentence); 
while(scanner.hasNext()) { 
    list.add(scanner.next()); 
} 
String[] words = new String[1]; 
words = list.toArray(words); 

The model is then built using the file containing the model:

POSModel posModel = new POSModel(input); 

The tagger is then created based on the model:

POSTaggerME posTagger = new POSTaggerME(posModel); 

The tag method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:

String[] posTags = posTagger.tag(words); 
for(int i=0; i<posTags.length; i++) { 
    out.println(words[i] + " - " + posTags[i]); 
} 

The output for this example follows:

Let's - NNP
parse - NN
this - DT
sentence. - NN

The analysis has determined that the word let's is a singular proper noun while the words parse and sentence are singular nouns. The word this is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.

Understanding POS tags

The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list: 

Tag

Description

Tag

Description

DT

Determiner

RB

Adverb

JJ

Adjective

RBR

Adverb, comparative

JJR

Adjective, comparative

RBS

Adverb, superlative

JJS

Adjective, superlative

RP

Particle

NN

Noun, singular or mass

SYM

Symbol

NNS

Noun, plural

TOP

Top of the parse tree

NNP

Proper noun, singular

VB

Verb, base form

NNPS

Proper noun, plural

VBD

Verb, past tense

POS

Possessive ending

VBG

Verb, gerund or present participle

PRP

Personal pronoun

VBN

Verb, past participle

PRP$

Possessive pronoun

VBP

Verb, non-3rd person singular present

S

Simple declarative clause

VBZ

Verb, 3rd person singular present

As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence objects whose toString method returns the score and POS list:

    Sequence sequences[] = posTagger.topKSequences(words); 
    for(Sequence sequence : sequences) { 
        out.println(sequence); 
    } 

The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:

-2.3264880694837213 [NNP, NN, DT, NN]
-2.6610271245387853 [NNP, VBD, DT, NN]
-2.6630142638557217 [NNP, VB, DT, NN]

Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse, is determined to have other possible tags.

Next, we will demonstrate how to extract relationships from text.

Using OpenNLP to identify POS

To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.

We will be using the POSModel class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin model. This has been trained on English text using what is called maximum entropy.

Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.

We will also use the POSTaggerME class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.

We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:

try (InputStream input = new FileInputStream( 
        new File("en-pos-maxent.bin"));) { 
    String sentence = "Let's parse this sentence."; 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words. The first part uses the Scanner class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List class's toArray method is used to create the array of strings:

List<String> list = new ArrayList<>(); 
Scanner scanner = new Scanner(sentence); 
while(scanner.hasNext()) { 
    list.add(scanner.next()); 
} 
String[] words = new String[1]; 
words = list.toArray(words); 

The model is then built using the file containing the model:

POSModel posModel = new POSModel(input); 

The tagger is then created based on the model:

POSTaggerME posTagger = new POSTaggerME(posModel); 

The tag method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:

String[] posTags = posTagger.tag(words); 
for(int i=0; i<posTags.length; i++) { 
    out.println(words[i] + " - " + posTags[i]); 
} 

The output for this example follows:

Let's - NNP
parse - NN
this - DT
sentence. - NN

The analysis has determined that the word let's is a singular proper noun while the words parse and sentence are singular nouns. The word this is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.

Understanding POS tags

The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list: 

Tag

Description

Tag

Description

DT

Determiner

RB

Adverb

JJ

Adjective

RBR

Adverb, comparative

JJR

Adjective, comparative

RBS

Adverb, superlative

JJS

Adjective, superlative

RP

Particle

NN

Noun, singular or mass

SYM

Symbol

NNS

Noun, plural

TOP

Top of the parse tree

NNP

Proper noun, singular

VB

Verb, base form

NNPS

Proper noun, plural

VBD

Verb, past tense

POS

Possessive ending

VBG

Verb, gerund or present participle

PRP

Personal pronoun

VBN

Verb, past participle

PRP$

Possessive pronoun

VBP

Verb, non-3rd person singular present

S

Simple declarative clause

VBZ

Verb, 3rd person singular present

As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence objects whose toString method returns the score and POS list:

    Sequence sequences[] = posTagger.topKSequences(words); 
    for(Sequence sequence : sequences) { 
        out.println(sequence); 
    } 

The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:

-2.3264880694837213 [NNP, NN, DT, NN]
-2.6610271245387853 [NNP, VBD, DT, NN]
-2.6630142638557217 [NNP, VB, DT, NN]

Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse, is determined to have other possible tags.

Next, we will demonstrate how to extract relationships from text.

Understanding POS tags

The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list: 

Tag

Description

Tag

Description

DT

Determiner

RB

Adverb

JJ

Adjective

RBR

Adverb, comparative

JJR

Adjective, comparative

RBS

Adverb, superlative

JJS

Adjective, superlative

RP

Particle

NN

Noun, singular or mass

SYM

Symbol

NNS

Noun, plural

TOP

Top of the parse tree

NNP

Proper noun, singular

VB

Verb, base form

NNPS

Proper noun, plural

VBD

Verb, past tense

POS

Possessive ending

VBG

Verb, gerund or present participle

PRP

Personal pronoun

VBN

Verb, past participle

PRP$

Possessive pronoun

VBP

Verb, non-3rd person singular present

S

Simple declarative clause

VBZ

Verb, 3rd person singular present

As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence objects whose toString method returns the score and POS list:

    Sequence sequences[] = posTagger.topKSequences(words); 
    for(Sequence sequence : sequences) { 
        out.println(sequence); 
    } 

The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:

-2.3264880694837213 [NNP, NN, DT, NN]
-2.6610271245387853 [NNP, VBD, DT, NN]
-2.6630142638557217 [NNP, VB, DT, NN]

Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse, is determined to have other possible tags.

Next, we will demonstrate how to extract relationships from text.

Extracting relationships from sentences

Knowing the relationship between elements of a sentence is important in many analysis tasks. It is useful for assessing the important content of a sentence and providing insight into the meaning of a sentence. This type of analysis has been used for tasks ranging from grammar checking to speech recognition to language translations.

In the previous section, we demonstrated one approach used to extract the parts of speech. Using this technique, we were able to identify the sentence element types present in a sentence. However, the relationships between these elements is missing. We need to parse the sentence to extract these relationships between sentence elements.

Using OpenNLP to extract relationships

There are several techniques and APIs that can be used to extract this type of information. In this section we will use OpenNLP to demonstrate one way of extracting the structure of a sentence. The demonstration is centered around the ParserTool class, which uses a previously trained model. The parsing process will return the probabilities that the sentence's elements extracted are correct. As will many NLP tasks, there are often multiple answers possible.

We start with a try-with-resource block to open an input stream for the model. The en-parser-chunking.bin file contains a model that uses parses text into its POS. In this case, it is trained for English:

try (InputStream modelInputStream = new FileInputStream( 
            new File("en-parser-chunking.bin"));) { 
    ... 
} catch (Exception ex) { 
    // Handle exceptions 
}  

Within the try block an instance of the ParserModel class is created using the input stream. The actual parser is created next using the ParserFactory class's create method:

ParserModel parserModel = new ParserModel(modelInputStream); 
Parser parser = ParserFactory.create(parserModel); 

We will use the following sentence to test the parser. The ParserTool class's parseLine method does the actual parsing and returns an array of Parse objects. Each of these objects holds one parsing alternative. The last argument of the parseLine method specifies how many alternatives to return:

String sentence = "Let's parse this sentence."; 
Parse[] parseTrees = ParserTool.parseLine(sentence, parser, 3); 

The next sequence displays each of the possibilities:

for(Parse tree : parseTrees) { 
    tree.show(); 
} 

The output of the show method for this example follows. The tags were previously defined in Understanding POS tags section:

(TOP (NP (NP (NNP Let's) (NN parse)) (NP (DT this) (NN sentence.))))
(TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.)))))
(TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)))))

The following example reformats the last two outputs to better show the relationships. They differ in how they classify the verb parse:

(TOP 
(S 
(NP (NNP Let's)) 
(VP (VB parse) 
(NP (DT this) (NN sentence.))
)
)
)
(TOP 
(S 
(NP (NNP Let's)) 
(VP (VBD parse) 
(NP (DT this) (NN sentence.)) 
)
)
)

When there are multiple parse alternatives, the Parse class's getProb returns a probability that reflects the model's confidence in the alternatives. The following sequence demonstrates this method:

for(Parse tree : parseTrees) { 
    out.println("Probability: " + tree.getProb()); 
} 

The output follows:

Probability: -3.6810244423259078
Probability: -3.742475884515823
Probability: -4.16148634555491

Another interesting NLP task is sentiment analysis, which we will demonstrate next.

Using OpenNLP to extract relationships

There are several techniques and APIs that can be used to extract this type of information. In this section we will use OpenNLP to demonstrate one way of extracting the structure of a sentence. The demonstration is centered around the ParserTool class, which uses a previously trained model. The parsing process will return the probabilities that the sentence's elements extracted are correct. As will many NLP tasks, there are often multiple answers possible.

We start with a try-with-resource block to open an input stream for the model. The en-parser-chunking.bin file contains a model that uses parses text into its POS. In this case, it is trained for English:

try (InputStream modelInputStream = new FileInputStream( 
            new File("en-parser-chunking.bin"));) { 
    ... 
} catch (Exception ex) { 
    // Handle exceptions 
}  

Within the try block an instance of the ParserModel class is created using the input stream. The actual parser is created next using the ParserFactory class's create method:

ParserModel parserModel = new ParserModel(modelInputStream); 
Parser parser = ParserFactory.create(parserModel); 

We will use the following sentence to test the parser. The ParserTool class's parseLine method does the actual parsing and returns an array of Parse objects. Each of these objects holds one parsing alternative. The last argument of the parseLine method specifies how many alternatives to return:

String sentence = "Let's parse this sentence."; 
Parse[] parseTrees = ParserTool.parseLine(sentence, parser, 3); 

The next sequence displays each of the possibilities:

for(Parse tree : parseTrees) { 
    tree.show(); 
} 

The output of the show method for this example follows. The tags were previously defined in Understanding POS tags section:

(TOP (NP (NP (NNP Let's) (NN parse)) (NP (DT this) (NN sentence.))))
(TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.)))))
(TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)))))

The following example reformats the last two outputs to better show the relationships. They differ in how they classify the verb parse:

(TOP 
(S 
(NP (NNP Let's)) 
(VP (VB parse) 
(NP (DT this) (NN sentence.))
)
)
)
(TOP 
(S 
(NP (NNP Let's)) 
(VP (VBD parse) 
(NP (DT this) (NN sentence.)) 
)
)
)

When there are multiple parse alternatives, the Parse class's getProb returns a probability that reflects the model's confidence in the alternatives. The following sequence demonstrates this method:

for(Parse tree : parseTrees) { 
    out.println("Probability: " + tree.getProb()); 
} 

The output follows:

Probability: -3.6810244423259078
Probability: -3.742475884515823
Probability: -4.16148634555491

Another interesting NLP task is sentiment analysis, which we will demonstrate next.

Sentiment analysis

Sentiment analysis involves the evaluation and classification of words based on their context, meaning, and emotional implications. Typically, if we were to look up a word in a dictionary we will find a meaning or definition for the word but, taken out of the context of a sentence, we may not be able to ascribe detailed and precise meaning to the word.

For example, the word toast could be defined as simply a slice of heated and browned bread. But in the context of the sentence He's toast!, the meaning changes completely. Sentiment analysis seeks to derive meanings of words based on their context and usage.

It is important to note that advanced sentiment analysis will expand beyond simple positive or negative classification and ascribe detailed emotional meaning to words. It is far simpler to classify words as positive or negative but far more useful to classify them as happy, furious, indifferent, or anxious.

This type of analysis falls into the category of effective computing, a type of computing interested in the emotional implications and uses of technological tools. This type of computing is especially significant given the growing amount of emotionally influenced data readily available for analysis on social media sites today.

Being able to determine the emotional content of text enables a more targeted, and appropriate response. For example, being able to judge the emotional response in a chat session between a customer and technical representative can allow the representative to do a better job. This can be especially important when there is a cultural or language gap between them.

This type of analysis can also be applied to visual images. It could be used to gauge someone's response to a new product, such as when conducting a taste test, or to judge how people react to scenes of s movie or commercial.

As part of our example we will be using a bag-of-words model. Bag-of-words models simplify word representation for natural language processing by containing a set, known as the bag, of words irrespective of grammar or word order. The words have features used for classification, most importantly the frequency of each word. Because some words such as the, a, or and will naturally have a higher frequency in any text, the words are given a weight as well. Common words with less contextual significance will have a smaller weight and factor less into the text analysis.

Downloading and extracting the Word2Vec model

To demonstrate sentiment analysis, we will use Google's Word2Vec models in conjunction with DL4J to simply classify movie reviews as either positive or negative based upon the words used in the review. This example is adapted from work done by Alex Black (https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java). As discussed previously in this chapter, Word2Vec consists of two-layer neural networks trained to build meaning from the context of words. We will also be using a large set of movie reviews from http://ai.stanford.edu/~amaas/data/sentiment/.

Before we begin, you will need to download the Word2Vec data from https://code.google.com/p/word2vec/. The basic process includes:

  • Downloading and extracting the movie reviews
  • Loading the Word2Vec Google News vectors
  • Loading each movie review

The words within the reviews are then broken into vectors and used to train the network. We will train the network across five epochs and evaluate the network's performance after each epoch.

To begin, we first declare three final variables. The first is the URL to retrieve the training data, the second is the location to store our extracted data, and the third is the location of the Google News vectors on the local machine. Modify this third variable to reflect the location on your local machine:

public static final String TRAINING_DATA_URL =  
    "http://ai.stanford.edu/~amaas/" +  
    "data/sentiment/aclImdb_v1.tar.gz"; 
public static final String EXTRACT_DATA_PATH =  
    FilenameUtils.concat(System.getProperty( 
    "java.io.tmpdir"), "dl4j_w2vSentiment/"); 
public static final String GNEWS_VECTORS_PATH =  
    "C:/YOUR_PATH/GoogleNews-vectors-negative300.bin" +  
    "/GoogleNews-vectors-negative300.bin"; 
 

Next we download and extract our model data. The next two methods are modelled after the code found in the DL4J example. We first create a new method, getModelData. The method is shown next in its entirety.

First we create a new File using the EXTRACT_DATA_PATH we defined previously. If the file does not already exist, we create a new directory. Next, we create two more File objects, one for the path to the archived TAR file and one for the path to the extracted data. Before we attempt to extract the data, we check whether these two files exist. If the archive path does not exist, we download the data from the TRAINING_DATA_URL and then extract the data. If the extracted file does not exist, we then extract the data:

  
private static void getModelData() throws Exception { 
    File modelDir = new File(EXTRACT_DATA_PATH); 
    if (!modelDir.exists()) { 
        modelDir.mkdir(); 
    } 
    String archivePath = EXTRACT_DATA_PATH + "aclImdb_v1.tar.gz"; 
    File archiveName = new File(archivePath); 
    String extractPath = EXTRACT_DATA_PATH + "aclImdb"; 
    File extractName = new File(extractPath); 
    if (!archiveName.exists()) { 
        FileUtils.copyURLToFile(new URL(TRAINING_DATA_URL), 
                archiveName); 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } else if (!extractName.exists()) { 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } 
} 

To extract our data, we will create another method called extractTar. We will provide two inputs to the method, the archivePath and the EXTRACT_DATA_PATH defined before. We also need to define our buffer size to use in the extraction process:

private static final int BUFFER_SIZE = 4096; 

We first create a new TarArchiveInputStream. We use the GzipCompressorInputStream because it provides support for extracting .gz files. We also use the BufferedInputStream to improve performance in our extraction process. The compressed file is very large and may take some time to download and extract.

Next we create a TarArchiveEntry and begin reading in data using the TarArchiveInputStream getNextEntry method. As we process entry in the compressed file, we first check whether the entry is a directory. If it is, we create a new directory in our extraction location. Finally we create a new FileOutputStream and BufferedOutputStream and use the write method to write our data in the extracted location:

private static void extractTar(String dataIn, String dataOut)
    throws IOException {
        try (TarArchiveInputStream inStream =
            new TarArchiveInputStream(
                new GzipCompressorInputStream(
                    new BufferedInputStream(
                        new FileInputStream(dataIn))))) {
            TarArchiveEntry tarFile;
            while ((tarFile = (TarArchiveEntry) inStream.getNextEntry())
                != null) {
                if (tarFile.isDirectory()) {
                    new File(dataOut + tarFile.getName()).mkdirs();
                }else {
                    int count;
                    byte data[] = new byte[BUFFER_SIZE];
                    FileOutputStream fileInStream =
                      new FileOutputStream(dataOut + tarFile.getName());
                    BufferedOutputStream outStream = 
                      new BufferedOutputStream(fileInStream,
                        BUFFER_SIZE);
                    while ((count = inStream.read(data, 0, BUFFER_SIZE))
                        != -1) {
                            outStream.write(data, 0, count);
                    }
                }
            }
        }
    }

Building our model and classifying text

Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize refers to the amount of words we process in each example, in this case 50. Our vectorSize determines the size of the vectors. The Google News model has word vectors of size 300. nEpochs refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300 words:

int batchSize = 50;      
int vectorSize = 300; 
int nEpochs = 5;         
int truncateReviewsToLength = 300;   

Now we can set up our neural network. We will use a MultiLayerConfiguration network, as discussed in Chapter 8, Deep Learning . In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300. We also use gradientNormalization, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax activation function, which was discussed in Chapter 8, Deep Learning . This function uses regression and is especially suited for classification algorithms:

MultiLayerConfiguration sentimentNN =  
         new NeuralNetConfiguration.Builder() 
        .optimizationAlgo(OptimizationAlgorithm 
                 .STOCHASTIC_GRADIENT_DESCENT).iterations(1) 
        .updater(Updater.RMSPROP) 
        .regularization(true).l2(1e-5) 
        .weightInit(WeightInit.XAVIER) 
        .gradientNormalization(GradientNormalization 
                 .ClipElementWiseAbsoluteValue) 
                 .gradientNormalizationThreshold(1.0) 
        .learningRate(0.0018) 
        .list() 
        .layer(0, new GravesLSTM.Builder() 
                 .nIn(vectorSize).nOut(200) 
                .activation("softsign").build()) 
        .layer(1, new RnnOutputLayer.Builder() 
                .activation("softmax") 
                .lossFunction(LossFunctions.LossFunction.MCXENT) 
                .nIn(200).nOut(2).build()) 
        .pretrain(false).backprop(true).build(); 
 

We can then create our MultiLayerNetwork, initialize the network, and set listeners.

MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); 
net.init(); 
net.setListeners(new ScoreIterationListener(1)); 

Next we create a WordVectors object to load our Google data. We use a DataSetIterator to test and train our data. The AsyncDataSetIterator allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:

WordVectors wordVectors = WordVectorSerializer
DataSetIterator trainData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        batchSize, truncateReviewsToLength, true), 1);
DataSetIterator testData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        100, truncateReviewsToLength, false), 1);

Finally, we are ready to train and evaluate our data. We run through our data nEpochs times; in this case, we have five iterations. Each iteration executes the fit method against our training data and then creates a new Evaluation object to evaluate our model using testData. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries method for evaluation. Finally, we print out our evaluation statistics:

for (int i = 0; i < nEpochs; i++) { 
    net.fit(trainData); 
    trainData.reset(); 
 
    Evaluation evaluation = new Evaluation(); 
    while (testData.hasNext()) { 
        DataSet t = testData.next(); 
        INDArray dataFeatures = t.getFeatureMatrix(); 
        INDArray dataLabels = t.getLabels(); 
        INDArray inMask = t.getFeaturesMaskArray(); 
        INDArray outMask = t.getLabelsMaskArray(); 
        INDArray predicted = net.output(dataFeatures, false,  
            inMask, outMask); 
 
        evaluation.evalTimeSeries(dataLabels, predicted, outMask); 
    } 
    testData.reset(); 
 
    out.println(evaluation.stats()); 
} 

The output from the final iteration is shown next. Our examples classified as 0 are considered negative reviews and the ones classified as 1 are considered positive reviews:

Epoch 4 complete. Starting evaluation:
Examples labeled as 0 classified by model as 0: 11122 times
Examples labeled as 0 classified by model as 1: 1378 times
Examples labeled as 1 classified by model as 0: 3193 times
Examples labeled as 1 classified by model as 1: 9307 times
==========================Scores===================================Accuracy: 0.8172
Precision: 0.824
Recall: 0.8172
F1 Score: 0.8206
===================================================================

If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.

Downloading and extracting the Word2Vec model

To demonstrate sentiment analysis, we will use Google's Word2Vec models in conjunction with DL4J to simply classify movie reviews as either positive or negative based upon the words used in the review. This example is adapted from work done by Alex Black (https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java). As discussed previously in this chapter, Word2Vec consists of two-layer neural networks trained to build meaning from the context of words. We will also be using a large set of movie reviews from http://ai.stanford.edu/~amaas/data/sentiment/.

Before we begin, you will need to download the Word2Vec data from https://code.google.com/p/word2vec/. The basic process includes:

  • Downloading and extracting the movie reviews
  • Loading the Word2Vec Google News vectors
  • Loading each movie review

The words within the reviews are then broken into vectors and used to train the network. We will train the network across five epochs and evaluate the network's performance after each epoch.

To begin, we first declare three final variables. The first is the URL to retrieve the training data, the second is the location to store our extracted data, and the third is the location of the Google News vectors on the local machine. Modify this third variable to reflect the location on your local machine:

public static final String TRAINING_DATA_URL =  
    "http://ai.stanford.edu/~amaas/" +  
    "data/sentiment/aclImdb_v1.tar.gz"; 
public static final String EXTRACT_DATA_PATH =  
    FilenameUtils.concat(System.getProperty( 
    "java.io.tmpdir"), "dl4j_w2vSentiment/"); 
public static final String GNEWS_VECTORS_PATH =  
    "C:/YOUR_PATH/GoogleNews-vectors-negative300.bin" +  
    "/GoogleNews-vectors-negative300.bin"; 
 

Next we download and extract our model data. The next two methods are modelled after the code found in the DL4J example. We first create a new method, getModelData. The method is shown next in its entirety.

First we create a new File using the EXTRACT_DATA_PATH we defined previously. If the file does not already exist, we create a new directory. Next, we create two more File objects, one for the path to the archived TAR file and one for the path to the extracted data. Before we attempt to extract the data, we check whether these two files exist. If the archive path does not exist, we download the data from the TRAINING_DATA_URL and then extract the data. If the extracted file does not exist, we then extract the data:

  
private static void getModelData() throws Exception { 
    File modelDir = new File(EXTRACT_DATA_PATH); 
    if (!modelDir.exists()) { 
        modelDir.mkdir(); 
    } 
    String archivePath = EXTRACT_DATA_PATH + "aclImdb_v1.tar.gz"; 
    File archiveName = new File(archivePath); 
    String extractPath = EXTRACT_DATA_PATH + "aclImdb"; 
    File extractName = new File(extractPath); 
    if (!archiveName.exists()) { 
        FileUtils.copyURLToFile(new URL(TRAINING_DATA_URL), 
                archiveName); 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } else if (!extractName.exists()) { 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } 
} 

To extract our data, we will create another method called extractTar. We will provide two inputs to the method, the archivePath and the EXTRACT_DATA_PATH defined before. We also need to define our buffer size to use in the extraction process:

private static final int BUFFER_SIZE = 4096; 

We first create a new TarArchiveInputStream. We use the GzipCompressorInputStream because it provides support for extracting .gz files. We also use the BufferedInputStream to improve performance in our extraction process. The compressed file is very large and may take some time to download and extract.

Next we create a TarArchiveEntry and begin reading in data using the TarArchiveInputStream getNextEntry method. As we process entry in the compressed file, we first check whether the entry is a directory. If it is, we create a new directory in our extraction location. Finally we create a new FileOutputStream and BufferedOutputStream and use the write method to write our data in the extracted location:

private static void extractTar(String dataIn, String dataOut)
    throws IOException {
        try (TarArchiveInputStream inStream =
            new TarArchiveInputStream(
                new GzipCompressorInputStream(
                    new BufferedInputStream(
                        new FileInputStream(dataIn))))) {
            TarArchiveEntry tarFile;
            while ((tarFile = (TarArchiveEntry) inStream.getNextEntry())
                != null) {
                if (tarFile.isDirectory()) {
                    new File(dataOut + tarFile.getName()).mkdirs();
                }else {
                    int count;
                    byte data[] = new byte[BUFFER_SIZE];
                    FileOutputStream fileInStream =
                      new FileOutputStream(dataOut + tarFile.getName());
                    BufferedOutputStream outStream = 
                      new BufferedOutputStream(fileInStream,
                        BUFFER_SIZE);
                    while ((count = inStream.read(data, 0, BUFFER_SIZE))
                        != -1) {
                            outStream.write(data, 0, count);
                    }
                }
            }
        }
    }

Building our model and classifying text

Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize refers to the amount of words we process in each example, in this case 50. Our vectorSize determines the size of the vectors. The Google News model has word vectors of size 300. nEpochs refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300 words:

int batchSize = 50;      
int vectorSize = 300; 
int nEpochs = 5;         
int truncateReviewsToLength = 300;   

Now we can set up our neural network. We will use a MultiLayerConfiguration network, as discussed in Chapter 8, Deep Learning . In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300. We also use gradientNormalization, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax activation function, which was discussed in Chapter 8, Deep Learning . This function uses regression and is especially suited for classification algorithms:

MultiLayerConfiguration sentimentNN =  
         new NeuralNetConfiguration.Builder() 
        .optimizationAlgo(OptimizationAlgorithm 
                 .STOCHASTIC_GRADIENT_DESCENT).iterations(1) 
        .updater(Updater.RMSPROP) 
        .regularization(true).l2(1e-5) 
        .weightInit(WeightInit.XAVIER) 
        .gradientNormalization(GradientNormalization 
                 .ClipElementWiseAbsoluteValue) 
                 .gradientNormalizationThreshold(1.0) 
        .learningRate(0.0018) 
        .list() 
        .layer(0, new GravesLSTM.Builder() 
                 .nIn(vectorSize).nOut(200) 
                .activation("softsign").build()) 
        .layer(1, new RnnOutputLayer.Builder() 
                .activation("softmax") 
                .lossFunction(LossFunctions.LossFunction.MCXENT) 
                .nIn(200).nOut(2).build()) 
        .pretrain(false).backprop(true).build(); 
 

We can then create our MultiLayerNetwork, initialize the network, and set listeners.

MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); 
net.init(); 
net.setListeners(new ScoreIterationListener(1)); 

Next we create a WordVectors object to load our Google data. We use a DataSetIterator to test and train our data. The AsyncDataSetIterator allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:

WordVectors wordVectors = WordVectorSerializer
DataSetIterator trainData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        batchSize, truncateReviewsToLength, true), 1);
DataSetIterator testData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        100, truncateReviewsToLength, false), 1);

Finally, we are ready to train and evaluate our data. We run through our data nEpochs times; in this case, we have five iterations. Each iteration executes the fit method against our training data and then creates a new Evaluation object to evaluate our model using testData. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries method for evaluation. Finally, we print out our evaluation statistics:

for (int i = 0; i < nEpochs; i++) { 
    net.fit(trainData); 
    trainData.reset(); 
 
    Evaluation evaluation = new Evaluation(); 
    while (testData.hasNext()) { 
        DataSet t = testData.next(); 
        INDArray dataFeatures = t.getFeatureMatrix(); 
        INDArray dataLabels = t.getLabels(); 
        INDArray inMask = t.getFeaturesMaskArray(); 
        INDArray outMask = t.getLabelsMaskArray(); 
        INDArray predicted = net.output(dataFeatures, false,  
            inMask, outMask); 
 
        evaluation.evalTimeSeries(dataLabels, predicted, outMask); 
    } 
    testData.reset(); 
 
    out.println(evaluation.stats()); 
} 

The output from the final iteration is shown next. Our examples classified as 0 are considered negative reviews and the ones classified as 1 are considered positive reviews:

Epoch 4 complete. Starting evaluation:
Examples labeled as 0 classified by model as 0: 11122 times
Examples labeled as 0 classified by model as 1: 1378 times
Examples labeled as 1 classified by model as 0: 3193 times
Examples labeled as 1 classified by model as 1: 9307 times
==========================Scores===================================Accuracy: 0.8172
Precision: 0.824
Recall: 0.8172
F1 Score: 0.8206
===================================================================

If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.

Building our model and classifying text

Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize refers to the amount of words we process in each example, in this case 50. Our vectorSize determines the size of the vectors. The Google News model has word vectors of size 300. nEpochs refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300 words:

int batchSize = 50;      
int vectorSize = 300; 
int nEpochs = 5;         
int truncateReviewsToLength = 300;   

Now we can set up our neural network. We will use a MultiLayerConfiguration network, as discussed in Chapter 8, Deep Learning . In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300. We also use gradientNormalization, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax activation function, which was discussed in Chapter 8, Deep Learning . This function uses regression and is especially suited for classification algorithms:

MultiLayerConfiguration sentimentNN =  
         new NeuralNetConfiguration.Builder() 
        .optimizationAlgo(OptimizationAlgorithm 
                 .STOCHASTIC_GRADIENT_DESCENT).iterations(1) 
        .updater(Updater.RMSPROP) 
        .regularization(true).l2(1e-5) 
        .weightInit(WeightInit.XAVIER) 
        .gradientNormalization(GradientNormalization 
                 .ClipElementWiseAbsoluteValue) 
                 .gradientNormalizationThreshold(1.0) 
        .learningRate(0.0018) 
        .list() 
        .layer(0, new GravesLSTM.Builder() 
                 .nIn(vectorSize).nOut(200) 
                .activation("softsign").build()) 
        .layer(1, new RnnOutputLayer.Builder() 
                .activation("softmax") 
                .lossFunction(LossFunctions.LossFunction.MCXENT) 
                .nIn(200).nOut(2).build()) 
        .pretrain(false).backprop(true).build(); 
 

We can then create our MultiLayerNetwork, initialize the network, and set listeners.

MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); 
net.init(); 
net.setListeners(new ScoreIterationListener(1)); 

Next we create a WordVectors object to load our Google data. We use a DataSetIterator to test and train our data. The AsyncDataSetIterator allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:

WordVectors wordVectors = WordVectorSerializer
DataSetIterator trainData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        batchSize, truncateReviewsToLength, true), 1);
DataSetIterator testData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        100, truncateReviewsToLength, false), 1);

Finally, we are ready to train and evaluate our data. We run through our data nEpochs times; in this case, we have five iterations. Each iteration executes the fit method against our training data and then creates a new Evaluation object to evaluate our model using testData. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries method for evaluation. Finally, we print out our evaluation statistics:

for (int i = 0; i < nEpochs; i++) { 
    net.fit(trainData); 
    trainData.reset(); 
 
    Evaluation evaluation = new Evaluation(); 
    while (testData.hasNext()) { 
        DataSet t = testData.next(); 
        INDArray dataFeatures = t.getFeatureMatrix(); 
        INDArray dataLabels = t.getLabels(); 
        INDArray inMask = t.getFeaturesMaskArray(); 
        INDArray outMask = t.getLabelsMaskArray(); 
        INDArray predicted = net.output(dataFeatures, false,  
            inMask, outMask); 
 
        evaluation.evalTimeSeries(dataLabels, predicted, outMask); 
    } 
    testData.reset(); 
 
    out.println(evaluation.stats()); 
} 

The output from the final iteration is shown next. Our examples classified as 0 are considered negative reviews and the ones classified as 1 are considered positive reviews:

Epoch 4 complete. Starting evaluation:
Examples labeled as 0 classified by model as 0: 11122 times
Examples labeled as 0 classified by model as 1: 1378 times
Examples labeled as 1 classified by model as 0: 3193 times
Examples labeled as 1 classified by model as 1: 9307 times
==========================Scores===================================Accuracy: 0.8172
Precision: 0.824
Recall: 0.8172
F1 Score: 0.8206
===================================================================

If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.

Summary

In this chapter, we introduced a number of NLP tasks and showed how they are supported. In particular, we used OpenNLP and DL4J to illustrate how they are performed. While there are a number of other libraries available, these examples provide a good introduction to the techniques.

We started with an introduction to basic NLP terms and concepts such as named entity recognition, POS, and relationships between elements of a sentence. Named entity recognition is concerned with finding and labeling the parts of a sentence such as people, locations, and things. POS associates labels with elements of a sentence. For example, NN refers to a noun and VB to a verb.

We then included a discussion of the Word2Vec and Doc2Vec neural networks. These were used to classify text, both with labels and by similarity with other words. We demonstrated the use of DL4J resources to create feature vectors for document association with labels.

While the identification of these associations is interesting, a more useful analysis is performed when relationships are extracted from a sentence. We demonstrated how relationships are found using OpenNLP. The POS are associated with each word and the relationships between the words are shown using a set of tags and parentheses. This type of analysis can be used for more sophisticated analyses such as language translation and grammar checking.

Finally, we discussed and showed examples of sentiment analysis. This process involves classifying text based on its tone or contextual meaning. We examined a process for classifying movie reviews as positive or negative.

In this chapter, we demonstrated various techniques for text analysis and classification. In our next chapter, we will examine techniques designed for video and audio analysis.