Java: Data Science Made Easy

Book Image

Java: Data Science Made Easy

By : Richard M. Reese, Jennifer L. Reese, Alexey Grigorev

Book Image

Java: Data Science Made Easy

By: Richard M. Reese, Jennifer L. Reese, Alexey Grigorev

Overview of this book

Data science is concerned with extracting knowledge and insights from a wide variety of data sources to analyse patterns or predict future behaviour. It draws from a wide array of disciplines including statistics, computer science, mathematics, machine learning, and data mining. In this course, we cover the basic as well as advanced data science concepts and how they are implemented using the popular Java tools and libraries.The course starts with an introduction of data science, followed by the basic data science tasks of data collection, data cleaning, data analysis, and data visualization. This is followed by a discussion of statistical techniques and more advanced topics including machine learning, neural networks, and deep learning. You will examine the major categories of data analysis including text, visual, and audio data, followed by a discussion of resources that support parallel implementation. Throughout this course, the chapters will illustrate a challenging data science problem, and then go on to present a comprehensive, Java-based solution to tackle that problem. You will cover a wide range of topics – from classification and regression, to dimensionality reduction and clustering, deep learning and working with Big Data. Finally, you will see the different ways to deploy the model and evaluate it in production settings. By the end of this course, you will be up and running with various facets of data science using Java, in no time at all. This course contains premium content from two of our recently published popular titles: - Java for Data Science - Mastering Java for Data Science

Title Page

Credits

Preface

Free Chapter

Module 1

Getting Started with Data Science

Getting Started with Data Science

Problems solved using data science

Understanding the data science problem - Â solving approach

Acquiring data for an application

The importance and process of cleaning data

Visualizing data to enhance understanding

The use of statistical methods in data science

Machine learning applied to data science

Using neural networks in data science

Deep learning approaches

Performing text analysis

Visual and audio analysis

Improving application performance using parallel techniques

Assembling the pieces

Data Acquisition

Data Acquisition

Understanding the data formats used in data science applications

Data acquisition techniques

Data Cleaning

Handling data formats

The nitty gritty of cleaning text

Cleaning images

Data Visualization

Data Visualization

Understanding plots and graphs

Creating index charts

Creating bar charts

Creating stacked graphs

Creating pie charts

Creating scatter charts

Creating histograms

Creating donut charts

Creating bubble charts

Statistical Data Analysis Techniques

Statistical Data Analysis Techniques

Working with mean, mode, and median

Standard deviation

Sample size determination

Hypothesis testing

Regression analysis

Machine Learning

Machine Learning

Supervised learning techniques

Unsupervised machine learning

Reinforcement learning

Neural Networks

Neural Networks

Training a neural network

Understanding static neural networks

Understanding dynamic neural networks

Additional network architectures and algorithms

Deep Learning

Deeplearning4j architecture

Deep learning and regression analysis

Restricted Boltzmann Machines

Deep autoencoders

Convolutional networks

Recurrent Neural Networks

Text Analysis

Implementing named entity recognition

Classifying text

Understanding tagging and POS

Extracting relationships from sentences

Sentiment analysis

Visual and Audio Analysis

Visual and Audio Analysis

Understanding speech recognition

Extracting text from an image

Identifying faces

Classifying visual data

Visual and Audio Analysis

Visual and Audio Analysis

Understanding speech recognition

Extracting text from an image

Identifying faces

Classifying visual data

Mathematical and Parallel Techniques for Data Analysis

Mathematical and Parallel Techniques for Data Analysis

Implementing basic matrix operations

Using map-reduce

Various mathematical libraries

Using Java 8 streams

Bringing It All Together

Bringing It All Together

Defining the purpose and scope of our application

Understanding the application's architecture

Data acquisition using Twitter

Understanding the TweetHandler class

Other optional enhancements

Module 2

Data Science Using Java

Data Science Using Java

Data science process models

Data science in Java

Data Processing Toolbox

Data Processing Toolbox

Standard Java library

Extensions to the standard library

Search engine - preparing data

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory data analysis in Java

Interactive Exploratory Data Analysis in Java

Supervised Learning - Classification and Regression

Supervised Learning - Classification and Regression

Case study - page prediction

Case study - hardware performance

Unsupervised Learning - Clustering and Dimensionality Reduction

Unsupervised Learning - Clustering and Dimensionality Reduction

Dimensionality reduction

Cluster analysis

Working with Text - Natural Language Processing and Information Retrieval

Working with Text - Natural Language Processing and Information Retrieval

Natural Language Processing and information retrieval

Machine learning for texts

Extreme Gradient Boosting

Extreme Gradient Boosting

Gradient Boosting Machines and XGBoost

XGBoost in practice

Deep Learning with DeepLearning4J

Deep Learning with DeepLearning4J

Neural Networks and DeepLearning4J

Deep learning for cats versus dogs

Scaling Data Science

Scaling Data Science

Link prediction

Deploying Data Science Models

Deploying Data Science Models

Online evaluation

Bibliography

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Chapter 4. Data Cleaning

Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshapingÂ , or munging. Data merging, where data from multiple sources isÂ combined, is often considered to be a data cleaning activity.

We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:

Validity: Ensuring that the data possesses the correct form or structure
Accuracy:Â The values within the data are truly representative of the dataset
Completeness: There are no missing elements
Consistency: Changes to data are in sync
Uniformity: The same units of measurement are used

There are several techniques and tools used to...