Book Image

Machine Learning: End-to-End guide for Java developers

By : Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath
Book Image

Machine Learning: End-to-End guide for Java developers

By: Boštjan Kaluža, Jennifer L. Reese, Krishna Choppella, Richard M. Reese, Uday Kamath

Overview of this book

Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning. The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books: [*]Java for Data Science [*]Machine Learning in Java [*]Mastering Java Machine Learning On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
Table of Contents (5 chapters)

Chapter 12. Bringing It All Together

While we have demonstrated many aspects of using Java to support data science tasks, the need to combine and use these techniques in an integrated manner exists. It is one thing to use the techniques in isolation and another to use them in a cohesive fashion. In this chapter, we will provide you with additional experience with these technologies and insights into how they can be used together.

Specifically, we will create a console-based application that analyzes tweets related to a user-defined topic. Using a console-based application allows us to focus on data-science-specific technologies and avoids having to choose a specific GUI technology that may not be relevant to us. It provides a common base from which a GUI implementation can be created if needed.

The application performs and illustrates the following high-level tasks:

  • Data acquisition
  • Data cleaning, including:
    • Removing stop words
    • Cleaning the text
      • Sentiment analysis
      • Basic data statistic collection
      • Display of results

More than one type of analysis can be used with many of these steps. We will show the more relevant approaches and allude to other possibilities as appropriate. We will use Java 8's features whenever possible.

Defining the purpose and scope of our application

The application will prompt the user for a set of selection criteria, which include topic and sub-topic areas, and the number of tweets to process. The analysis performed will simply compute and display the number of positive and negative tweets for a topic and sub-topic. We used a generic sentiment analysis model, which will affect the quality of the sentiment analysis. However, other models and more analysis can be added.

We will use a Java 8 stream to structure the processing of tweet data. It is a stream of TweetHandler objects, as we will describe shortly.

We use several classes in this application. They are summarized here:

  • TweetHandler: This class holds the raw tweet text and specific fields needed for the processing including the actual tweet, username, and similar attributes.
  • TwitterStream: This is used to acquire the application's data. Using a specific class separates the acquisition of the data from its processing. The class possesses a few fields that control how the data is acquired.
  • ApplicationDriver: This contains the main method, user prompts, and the TweetHandler stream that controls the analysis.

Each of these classes will be detailed in later sections. However, we will present ApplicationDriver next to provide an overview of the analysis process and how the user interacts with the application.

Understanding the application's architecture

Every application has its own unique structure, or architecture. This architecture provides the overarching organization or framework for the application. For this application, we combine the three classes using a Java 8 stream in the ApplicationDriver class. This class consists of three methods:

  • ApplicationDriver: Contains the applications' user input
  • performAnalysis: Performs the analysis
  • main: Creates the ApplicationDriver instance

The class structure is shown next. The three instance variables are used to control the processing:

public class ApplicationDriver { 
    private String topic; 
    private String subTopic; 
    private int numberOfTweets; 
 
    public ApplicationDriver() { ... } 
    public void performAnalysis() { ...     } 
 
    public static void main(String[] args) { 
        new ApplicationDriver(); 
    } 
} 

The ApplicationDriver constructor follows. A Scanner instance is created and the sentiment analysis model is built:

public ApplicationDriver() { 
    Scanner scanner = new Scanner(System.in); 
    TweetHandler swt = new TweetHandler(); 
    swt.buildSentimentAnalysisModel(); 
    ... 
} 

The remainder of the method prompts the user for input and then calls the performAnalysis method:

out.println("Welcome to the Tweet Analysis Application"); 
out.print("Enter a topic: "); 
this.topic = scanner.nextLine(); 
out.print("Enter a sub-topic: "); 
this.subTopic = scanner.nextLine().toLowerCase(); 
out.print("Enter number of tweets: "); 
this.numberOfTweets = scanner.nextInt(); 
performAnalysis(); 

The performAnalysis method uses a Java 8 Stream instance obtained from the TwitterStream instance. The TwitterStream class constructor uses the number of tweets and topic as input. This class is discussed in the Data acquisition using Twitter section:

public void performAnalysis() { 
Stream<TweetHandler> stream = new TwitterStream( 
    this.numberOfTweets, this.topic).stream(); 
    ... 
} 

The stream uses a series of map, filter, and a forEach method to perform the processing. The map method modifies the stream's elements. The filter methods remove elements from the stream. The forEach method will terminate the stream and generate the output.

The individual methods of the stream are executed in order. When acquired from a public Twitter stream, the Twitter information arrives as a JSON document, which we process first. This allows us to extract relevant tweet information and set the data to fields of the TweetHandler instance. Next, the text of the tweet is converted to lowercase. Only English tweets are processed and only those tweets that contain the sub-topic will be processed. The tweet is then processed. The last step computes the statistics:

stream 
        .map(s -> s.processJSON()) 
        .map(s -> s.toLowerCase()) 
        .filter(s -> s.isEnglish()) 
        .map(s -> s.removeStopWords()) 
        .filter(s -> s.containsCharacter(this.subTopic)) 
        .map(s -> s.performSentimentAnalysis()) 
        .forEach((TweetHandler s) -> { 
            s.computeStats(); 
            out.println(s); 
        }); 

The results of the processing are then displayed:

out.println(); 
out.println("Positive Reviews: " 
        + TweetHandler.getNumberOfPositiveReviews()); 
out.println("Negative Reviews: " 
        + TweetHandler.getNumberOfNegativeReviews()); 

We tested our application on a Monday night during a Monday-night football game and used the topic #MNF. The # symbol is called a hashtag and is used to categorize tweets. By selecting a popular category of tweets, we ensured that we would have plenty of Twitter data to work with. For simplicity, we chose the football subtopic. We also chose to only analyze 50 tweets for this example. The following is an abbreviated sample of our prompts, input, and output:

Building Sentiment Model
Welcome to the Tweet Analysis Application
Enter a topic: #MNF
Enter a sub-topic: football
Enter number of tweets: 50
Creating Twitter Stream
51 messages processed!
Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou 
Date: Mon Oct 24 20:28:20 CDT 2016
Category: neg
...
Text: i cannot emphasize enough how big td drive . @ broncos offense . needed confidence booster & amp ; just got . # mnf # denvshou 
Date: Mon Oct 24 20:28:52 CDT 2016
Category: pos
Text: least touchdown game . # mnf 
Date: Mon Oct 24 20:28:52 CDT 2016
Category: neg
Positive Reviews: 13
Negative Reviews: 27

We print out the text of each tweet, along with a timestamp and category. Notice that the text of the tweet does not always make sense. This may be due to the abbreviated nature of Twitter data, but it is partially due to the fact this text has been cleaned and stop words have been removed. We should still see our topic, #MNF, although it will be lowercase due to our text cleaning. At the end, we print out the total number of tweets classified as positive and negative.

The classification of tweets is done by the performSentimentAnalysis method. Notice the process of classification using sentiment analysis is not always precise. The following tweet mentions a touchdown by a Denver Broncos player. This tweet could be construed as positive or negative depending on an individual's personal feelings about that team, but our model classified it as positive:

Text: cj anderson td run @ broncos . broncos now lead 7 - 6 . # mnf 
Date: Mon Oct 24 20:28:42 CDT 2016
Category: pos

Additionally, some tweets may have a neutral tone, such as the one shown next, but still be classified as either positive or negative. The following tweet is a retweet of a popular sports news twitter handle, @bleacherreport:

Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou 
Date: Mon Oct 24 20:28:37 CDT 2016
Category: neg

This tweet has been classified as negative but perhaps could be considered neutral. The contents of the tweet simply provide information about a score in a football game. Whether this is a positive or negative event will depend upon which team a person may be rooting for. When we examine the entire set of tweet data analysed, we notice that this same @bleacherreport tweet has been retweeted a number of times and classified as negative each time. This could skew our analysis when we consider that we may have a large number of improperly classified tweets. Using incorrect data decreases the accuracy of the results.

One option, depending on the purpose of analysis, may be to exclude tweets by news outlets or other popular Twitter users. Additionally we could exclude tweets with RT, an abbreviation denoting that the tweet is a retweet of another user.

There are additional issues to consider when performing this type of analysis, including the sub-topic used. If we were to analyze the popularity of a Star Wars character, then we would need to be careful which names we use. For example, when choosing a character name such as Han Solo, the tweet may use an alias. Aliases for Han Solo include Vykk Draygo, Rysto, Jenos Idanian, Solo Jaxal, Master Marksman, and Jobekk Jonn, to mention a few (http://starwars.wikia.com/wiki/Category:Han_Solo_aliases). The actor's name may be used instead of the actual character, which is Harrison Ford in the case of Han Solo. We may also want to consider the actor's nickname, such as Harry for Harrison.

Data acquisition using Twitter

The Twitter API is used in conjunction with HBC's HTTP client to acquire tweets, as previously illustrated in the Handling Twitter section of Chapter 2, Data Acquisition. This process involves using the public stream API at the default access level to pull a sample of public tweets currently streaming on Twitter. We will refine the data based on user-selected keywords.

To begin, we declare the TwitterStream class. It consists of two instance variables, (numberOfTweets and topic), two constructors, and a stream method. The numberOfTweets variable contains the number of tweets to select and process, and topic allows the user to search for tweets related to a specific topic. We have set our default constructor to pull 100 tweets related to Star Wars:

public class TwitterStream { 
    private int numberOfTweets; 
    private String topic; 
 
    public TwitterStream() { 
        this(100, "Stars Wars"); 
    } 
 
    public TwitterStream(int numberOfTweets, String topic) { ... } 
 
} 

The heart of our TwitterStream class is the stream method. We start by performing authentication using the information provided by Twitter when we created our Twitter application. We then create a BlockingQueue object to hold our streaming data. In this example, we will set a default capacity of 1000. We use our topic variable in the trackTerms method to specify the types of tweets we are searching for. Finally, we specify our endpoint and turn off stall warnings:

String myKey = "mySecretKey"; 
String mySecret = "mySecret"; 
String myToken = "myToKen"; 
String myAccess = "myAccess"; 
 
out.println("Creating Twitter Stream"); 
BlockingQueue<String> statusQueue = new  
LinkedBlockingQueue<>(1000); 
StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint(); 
endpoint.trackTerms(Lists.newArrayList("twitterapi", this.topic)); 
endpoint.stallWarnings(false); 

Now we can create an Authentication object using OAuth1, a variation of the OAuth class. This allows us to build our connection client and complete the HTTP connection:

Authentication twitterAuth = new OAuth1(myKey, mySecret, myToken,
  myAccess); 
 
BasicClient twitterClient = new ClientBuilder() 
        .name("Twitter client") 
        .hosts(Constants.STREAM_HOST) 
        .endpoint(endpoint) 
        .authentication(twitterAuth) 
        .processor(new StringDelimitedProcessor(statusQueue)) 
        .build(); 
 
twitterClient.connect(); 

Next, we create two ArrayLists, list to hold our TweetHandler objects and twitterList to hold the JSON data streamed from Twitter. We will discuss the TweetHandler object in the next section. We use the drainTo method in place of the poll method demonstrated in Chapter 2, Data Acquisition, because it can be more efficient for large amounts of data:

List<TweetHandler> list = new ArrayList(); 
List<String> twitterList = new ArrayList(); 

Next we loop through our retrieved messages. We call the take method to remove each string message from the BlockingQueue instance. We then create a new TweetHandler object using the message and place it in our list. After we have handled all of our messages and the for loop completes, we stop the HTTP client, display the number of messages, and return our stream of TweetHandler objects:

statusQueue.drainTo(twitterList); 
for(int i=0; i<numberOfTweets; i++) { 
    String message; 
    try { 
        message = statusQueue.take(); 
        list.add(new TweetHandler(message)); 
    } catch (InterruptedException ex) { 
        ex.printStackTrace(); 
    } 
} 
twitterClient.stop(); 
out.printf("%d messages processed!\n",     
    twitterClient.getStatsTracker().getNumMessages()); 
 
return list.stream(); 
} 

We are now ready to clean and analyze our data.

Understanding the TweetHandler class

The TweetHandler class holds information about a specific tweet. It takes the raw JSON tweet and extracts those parts that are relevant to the application's needs. It also possesses the methods to process the tweet's text such as converting the text to lowercase and removing tweets that are not relevant. The first part of the class is shown next:

public class TweetHandler { 
    private String jsonText; 
    private String text; 
    private Date date; 
    private String language; 
    private String category; 
    private String userName; 
    ... 
    public TweetHandler processJSON() { ... } 
    public TweetHandler toLowerCase(){ ... } 
    public TweetHandler removeStopWords(){ ... }     
    public boolean isEnglish(){ ... }     
    public boolean containsCharacter(String character) { ... }        
    public void computeStats(){ ... } 
    public void buildSentimentAnalysisModel{ ... } 
    public TweetHandler performSentimentAnalysis(){ ... } 
} 

The instance variables show the type of data retrieved from a tweet and processed, as detailed here:

  • jsonText: The raw JSON text
  • text: The text of the processed tweet
  • date: The date of the tweet
  • language: The language of the tweet
  • category: The tweet classification, which is positive or negative
  • userName: The name of the Twitter user

There are several other instance variables used by the class. The following are used to create and use a sentiment analysis model. The classifier static variable refers to the model:

private static String[] labels = {"neg", "pos"}; 
private static int nGramSize = 8; 
private static DynamicLMClassifier<NGramProcessLM>  
    classifier = DynamicLMClassifier.createNGramProcess( 
        labels, nGramSize); 
     

The default constructor is used to provide an instance to build the sentiment model. The single argument constructor creates a TweetHandler object using the raw JSON text:

    public TweetHandler() { 
        this.jsonText = ""; 
    } 
 
    public TweetHandler(String jsonText) { 
        this.jsonText = jsonText; 
    } 

The remainder of the methods are discussed in the following sections.

Extracting data for a sentiment analysis model

In Chapter 9, Text Analysis, we used DL4J to perform sentiment analysis. We will use LingPipe in this example as an alternative to our previous approach. Because we want to classify Twitter data, we chose a dataset with pre-classified tweets, available at http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip. We must complete a one-time process of extracting this data into a format we can use with our model before we continue with our application development.

This dataset exists in a large .csv file with one tweet and classification per line. The tweets are classified as either 0 (negative) or 1 (positive). The following is an example of one line of this data file:

95,0,Sentiment140, - Longest night ever.. ugh!    http://tumblr.com/xwp1yxhi6 

The first element represents a unique ID number which is part of the original data set and which we will use for the filename. The second element is the classification, the third is a data set label (effectively ignored for the purposes of this project), and the last element is the actual tweet text. Before we can use this data with our LingPipe model, we must write each tweet into an individual file. To do this, we created three string variables. The filename variable will be assigned either pos or neg depending on each tweet's classification and will be used in the write operation. We also use the file variable to hold the name of the individual tweet file and the text variable to hold the individual tweet text. Next, we use the readAllLines method with the Paths class's get method to store our data in a List object. We need to specify the charset, StandardCharsets.ISO_8859_1, as well:

try { 
    String filename; 
    String file; 
    String text; 
    List<String> lines = Files.readAllLines( 
Paths.get("\\path-to-file\\SentimentAnalysisDataset.csv"),  
StandardCharsets.ISO_8859_1); 
    ... 
 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Now we can loop through our list and use the split method to store our .csv data in a string array. We convert the element at position 1 to an integer and determine whether it is a 1. Tweets classified with a 1 are considered positive tweets and we set filename to pos. All other tweets set the filename to neg. We extract the output filename from the element at position 0 and the text from element 3. We ignore the label in position 2 for the purposes of this project. Finally, we write out our data:

for(String s : lines) { 
    String[] oneLine = s.split(","); 
    if(Integer.parseInt(oneLine[1])==1) { 
        filename = "pos"; 
    } else { 
        filename = "neg"; 
    } 
    file = oneLine[0]+".txt"; 
    text = oneLine[3]; 
    Files.write(Paths.get( 
        path-to-file\\txt_sentoken"+filename+""+file), 
        text.getBytes()); 
} 

Notice that we created the neg and pos directories within the txt_sentoken directory. This location is important when we read the files to build our model.

Building the sentiment model

Now we are ready to build our model. We loop through the labels array, which contains pos and neg, and for each label we create a new Classification object. We then create a new file using this label and use the listFiles method to create an array of filenames. Next, we will traverse these filenames using a for loop:

public void buildSentimentAnalysisModel() { 
    out.println("Building Sentiment Model"); 
     
    File trainingDir = new File("\\path to file\\txt_sentoken"); 
    for (int i = 0; i < labels.length; i++) { 
        Classification classification =  
            new Classification(labels[i]); 
        File file = new File(trainingDir, labels[i]); 
        File[] trainingFiles = file.listFiles(); 
        ... 
    } 
} 

Within the for loop, we extract the tweet data and store it in our string, review. We then create a new Classified object using review and classification. Finally we can call the handle method to classify this particular text:

for (int j = 0; j < trainingFiles.length; j++) { 
    try { 
        String review = Files.readFromFile(trainingFiles[j],  
            "ISO-8859-1"); 
        Classified<CharSequence> classified = new  
            Classified<>(review, classification); 
        classifier.handle(classified); 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 
 } 

For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.

Processing the JSON input

The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler class.

The TweetHandler class's processJSON method does the actual data extraction. An instance of the JSONObject is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString method to get the fields we need.

The start of the processJSON method is shown next, where we start by obtaining the JSONObject instance, which we will use to extract the relevant parts of the tweet:

public TweetHandler processJSON() { 
    try { 
        JSONObject jsonObject = new JSONObject(this.jsonText); 
        ... 
    } catch (JSONException ex) { 
        // Handle exceptions 
    } 
    return this; 
} 

First, we extract the tweet's text as shown here:

this.text = jsonObject.getString("text"); 

Next, we extract the tweet's date. We use the SimpleDateFormat class to convert the date string to a Date object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy", whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:

  • EEE: Day of the week specified using three characters
  • MMM: Month, using three characters
  • d: Day of the month
  • HH:mm:ss: Hours, minutes, and seconds
  • Z: Time zone
  • yyyy: Year

The code follows:

SimpleDateFormat sdf = new SimpleDateFormat( 
    "EEE MMM d HH:mm:ss Z yyyy"); 
try { 
    this.date = sdf.parse(jsonObject.getString("created_at")); 
} catch (ParseException ex) { 
    // Handle exceptions 
} 

The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name field:

this.language = jsonObject.getString("lang"); 
JSONObject user = jsonObject.getJSONObject("user"); 
this.userName = user.getString("name"); 

Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Extracting data for a sentiment analysis model

In Chapter 9, Text Analysis, we used DL4J to perform sentiment analysis. We will use LingPipe in this example as an alternative to our previous approach. Because we want to classify Twitter data, we chose a dataset with pre-classified tweets, available at http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip. We must complete a one-time process of extracting this data into a format we can use with our model before we continue with our application development.

This dataset exists in a large .csv file with one tweet and classification per line. The tweets are classified as either 0 (negative) or 1 (positive). The following is an example of one line of this data file:

95,0,Sentiment140, - Longest night ever.. ugh!    http://tumblr.com/xwp1yxhi6 

The first element represents a unique ID number which is part of the original data set and which we will use for the filename. The second element is the classification, the third is a data set label (effectively ignored for the purposes of this project), and the last element is the actual tweet text. Before we can use this data with our LingPipe model, we must write each tweet into an individual file. To do this, we created three string variables. The filename variable will be assigned either pos or neg depending on each tweet's classification and will be used in the write operation. We also use the file variable to hold the name of the individual tweet file and the text variable to hold the individual tweet text. Next, we use the readAllLines method with the Paths class's get method to store our data in a List object. We need to specify the charset, StandardCharsets.ISO_8859_1, as well:

try { 
    String filename; 
    String file; 
    String text; 
    List<String> lines = Files.readAllLines( 
Paths.get("\\path-to-file\\SentimentAnalysisDataset.csv"),  
StandardCharsets.ISO_8859_1); 
    ... 
 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Now we can loop through our list and use the split method to store our .csv data in a string array. We convert the element at position 1 to an integer and determine whether it is a 1. Tweets classified with a 1 are considered positive tweets and we set filename to pos. All other tweets set the filename to neg. We extract the output filename from the element at position 0 and the text from element 3. We ignore the label in position 2 for the purposes of this project. Finally, we write out our data:

for(String s : lines) { 
    String[] oneLine = s.split(","); 
    if(Integer.parseInt(oneLine[1])==1) { 
        filename = "pos"; 
    } else { 
        filename = "neg"; 
    } 
    file = oneLine[0]+".txt"; 
    text = oneLine[3]; 
    Files.write(Paths.get( 
        path-to-file\\txt_sentoken"+filename+""+file), 
        text.getBytes()); 
} 

Notice that we created the neg and pos directories within the txt_sentoken directory. This location is important when we read the files to build our model.

Building the sentiment model

Now we are ready to build our model. We loop through the labels array, which contains pos and neg, and for each label we create a new Classification object. We then create a new file using this label and use the listFiles method to create an array of filenames. Next, we will traverse these filenames using a for loop:

public void buildSentimentAnalysisModel() { 
    out.println("Building Sentiment Model"); 
     
    File trainingDir = new File("\\path to file\\txt_sentoken"); 
    for (int i = 0; i < labels.length; i++) { 
        Classification classification =  
            new Classification(labels[i]); 
        File file = new File(trainingDir, labels[i]); 
        File[] trainingFiles = file.listFiles(); 
        ... 
    } 
} 

Within the for loop, we extract the tweet data and store it in our string, review. We then create a new Classified object using review and classification. Finally we can call the handle method to classify this particular text:

for (int j = 0; j < trainingFiles.length; j++) { 
    try { 
        String review = Files.readFromFile(trainingFiles[j],  
            "ISO-8859-1"); 
        Classified<CharSequence> classified = new  
            Classified<>(review, classification); 
        classifier.handle(classified); 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 
 } 

For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.

Processing the JSON input

The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler class.

The TweetHandler class's processJSON method does the actual data extraction. An instance of the JSONObject is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString method to get the fields we need.

The start of the processJSON method is shown next, where we start by obtaining the JSONObject instance, which we will use to extract the relevant parts of the tweet:

public TweetHandler processJSON() { 
    try { 
        JSONObject jsonObject = new JSONObject(this.jsonText); 
        ... 
    } catch (JSONException ex) { 
        // Handle exceptions 
    } 
    return this; 
} 

First, we extract the tweet's text as shown here:

this.text = jsonObject.getString("text"); 

Next, we extract the tweet's date. We use the SimpleDateFormat class to convert the date string to a Date object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy", whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:

  • EEE: Day of the week specified using three characters
  • MMM: Month, using three characters
  • d: Day of the month
  • HH:mm:ss: Hours, minutes, and seconds
  • Z: Time zone
  • yyyy: Year

The code follows:

SimpleDateFormat sdf = new SimpleDateFormat( 
    "EEE MMM d HH:mm:ss Z yyyy"); 
try { 
    this.date = sdf.parse(jsonObject.getString("created_at")); 
} catch (ParseException ex) { 
    // Handle exceptions 
} 

The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name field:

this.language = jsonObject.getString("lang"); 
JSONObject user = jsonObject.getJSONObject("user"); 
this.userName = user.getString("name"); 

Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Building the sentiment model

Now we are ready to build our model. We loop through the labels array, which contains pos and neg, and for each label we create a new Classification object. We then create a new file using this label and use the listFiles method to create an array of filenames. Next, we will traverse these filenames using a for loop:

public void buildSentimentAnalysisModel() { 
    out.println("Building Sentiment Model"); 
     
    File trainingDir = new File("\\path to file\\txt_sentoken"); 
    for (int i = 0; i < labels.length; i++) { 
        Classification classification =  
            new Classification(labels[i]); 
        File file = new File(trainingDir, labels[i]); 
        File[] trainingFiles = file.listFiles(); 
        ... 
    } 
} 

Within the for loop, we extract the tweet data and store it in our string, review. We then create a new Classified object using review and classification. Finally we can call the handle method to classify this particular text:

for (int j = 0; j < trainingFiles.length; j++) { 
    try { 
        String review = Files.readFromFile(trainingFiles[j],  
            "ISO-8859-1"); 
        Classified<CharSequence> classified = new  
            Classified<>(review, classification); 
        classifier.handle(classified); 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 
 } 

For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.

Processing the JSON input

The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler class.

The TweetHandler class's processJSON method does the actual data extraction. An instance of the JSONObject is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString method to get the fields we need.

The start of the processJSON method is shown next, where we start by obtaining the JSONObject instance, which we will use to extract the relevant parts of the tweet:

public TweetHandler processJSON() { 
    try { 
        JSONObject jsonObject = new JSONObject(this.jsonText); 
        ... 
    } catch (JSONException ex) { 
        // Handle exceptions 
    } 
    return this; 
} 

First, we extract the tweet's text as shown here:

this.text = jsonObject.getString("text"); 

Next, we extract the tweet's date. We use the SimpleDateFormat class to convert the date string to a Date object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy", whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:

  • EEE: Day of the week specified using three characters
  • MMM: Month, using three characters
  • d: Day of the month
  • HH:mm:ss: Hours, minutes, and seconds
  • Z: Time zone
  • yyyy: Year

The code follows:

SimpleDateFormat sdf = new SimpleDateFormat( 
    "EEE MMM d HH:mm:ss Z yyyy"); 
try { 
    this.date = sdf.parse(jsonObject.getString("created_at")); 
} catch (ParseException ex) { 
    // Handle exceptions 
} 

The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name field:

this.language = jsonObject.getString("lang"); 
JSONObject user = jsonObject.getJSONObject("user"); 
this.userName = user.getString("name"); 

Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Processing the JSON input

The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler class.

The TweetHandler class's processJSON method does the actual data extraction. An instance of the JSONObject is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString method to get the fields we need.

The start of the processJSON method is shown next, where we start by obtaining the JSONObject instance, which we will use to extract the relevant parts of the tweet:

public TweetHandler processJSON() { 
    try { 
        JSONObject jsonObject = new JSONObject(this.jsonText); 
        ... 
    } catch (JSONException ex) { 
        // Handle exceptions 
    } 
    return this; 
} 

First, we extract the tweet's text as shown here:

this.text = jsonObject.getString("text"); 

Next, we extract the tweet's date. We use the SimpleDateFormat class to convert the date string to a Date object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy", whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:

  • EEE: Day of the week specified using three characters
  • MMM: Month, using three characters
  • d: Day of the month
  • HH:mm:ss: Hours, minutes, and seconds
  • Z: Time zone
  • yyyy: Year

The code follows:

SimpleDateFormat sdf = new SimpleDateFormat( 
    "EEE MMM d HH:mm:ss Z yyyy"); 
try { 
    this.date = sdf.parse(jsonObject.getString("created_at")); 
} catch (ParseException ex) { 
    // Handle exceptions 
} 

The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name field:

this.language = jsonObject.getString("lang"); 
JSONObject user = jsonObject.getJSONObject("user"); 
this.userName = user.getString("name"); 

Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Cleaning data to improve our results

Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.

There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.

The conversion of the text to lowercase letters is easily achieved as shown here:

    public TweetHandler toLowerCase() { 
        this.text = this.text.toLowerCase().trim(); 
        return this; 
    } 

Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean return value is used by the filter method in the Java 8 stream, which performs the actual removal:

    public boolean isEnglish() { 
        return this.language.equalsIgnoreCase("en"); 
    } 
     
    public boolean containsCharacter(String character) { 
        return this.text.contains(character); 
    } 

Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Removing stop words

Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.

There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory instance:

public TweetHandler removeStopWords() { 
    TokenizerFactory tokenizerFactory 
            = IndoEuropeanTokenizerFactory.INSTANCE; 
    tokenizerFactory =  
        new EnglishStopTokenizerFactory(tokenizerFactory); 
    ... 
    return this; 
} 

A series of tokens that do not contain stop words are extracted, and a StringBuilder instance is used to create a string to replace the original text:

Tokenizer tokens = tokenizerFactory.tokenizer( 
        this.text.toCharArray(), 0, this.text.length()); 
StringBuilder buffer = new StringBuilder(); 
for (String word : tokens) { 
    buffer.append(word + " "); 
} 
this.text = buffer.toString(); 

The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Performing sentiment analysis

We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification object by passing our cleaned text to the classify method. We then use the bestCategory method to classify our text as either positive or negative. Finally, we set category to the result and return the TweetHandler object:

public TweetHandler performSentimentAnalysis() { 
    Classification classification =  
        classifier.classify(this.text); 
    String bestCategory = classification.bestCategory(); 
    this.category = bestCategory; 
    return this; 
} 

We are now ready to analyze the results of our application.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Analysing the results

The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:

    private static int numberOfPositiveReviews = 0; 
    private static int numberOfNegativeReviews = 0; 

The computeStats method is called from the Java 8 stream and increments the appropriate variable:

public void computeStats() { 
    if(this.category.equalsIgnoreCase("pos")) { 
        numberOfPositiveReviews++; 
    } else { 
        numberOfNegativeReviews++; 
    } 
} 

Two static methods provide access to the number of reviews:

public static int getNumberOfPositiveReviews() { 
    return numberOfPositiveReviews; 
} 
 
public static int getNumberOfNegativeReviews() { 
    return numberOfNegativeReviews; 
} 

In addition, a simple toString method is provided to display basic tweet information:

public String toString() { 
    return "\nText: " + this.text 
            + "\nDate: " + this.date 
            + "\nCategory: " + this.category; 
} 

More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.

Other optional enhancements

There are numerous improvements that can be made to the application. Many of these are user preferences and others relate to improving the results of the application. A GUI interface would be useful in many situations. Among the user options, we may want add support for:

  • Displaying individual tweets
  • Allowing null sub-topics
  • Processing other tweet fields
  • Providing list of topics or sub-topics the user can choose from
  • Generating additional statistics and supporting charts

With regard to process result improvements, the following should be considered:

  • Correct user entries for misspelling
  • Remove spacing around punctuation
  • Use alternate stop word removal techniques
  • Use alternate sentiment analysis techniques

The details of many of these enhancements are dependent on the GUI interface used and the purpose and scope of the application.

Summary

The intent of this chapter was to illustrate how various data science tasks can be integrated into an application. We chose an application that processes tweets because it is a popular social medium and allows us to apply many of the techniques discussed in earlier chapters.

A simple console-based interface was used to avoid cluttering the discussion with specific but possibly irrelevant GUI details. The application prompted the user for a Twitter topic, a sub-topic, and the number of tweets to process. The analysis consisted of determining the sentiments of the tweets, with simple statistics regarding the positive or negative nature of the tweets.

The first step in the process was to build a sentiment model. We used LingPipe classes to build a model and perform the analysis. A Java 8 stream was used and supported a fluent style of programming where the individual processing steps could be easily added and removed.

Once the stream was created, the JSON raw text was processed and used to initialize a TweetHandler class. Instances of this class were subsequently modified, including converting the text to lowercase, removing non-English tweets, removing stop words, and selecting only those tweets that contain the sub-topic. Sentiment analysis was then performed, followed by the computation of the statistics.

Data science is a broad topic that utilizes a wide range of statistical and computer science topics. In this book, we provided a brief introduction to many of these topics and how they are supported by Java.