Book Image

Java Data Analysis

By : John R. Hubbard
Book Image

Java Data Analysis

By: John R. Hubbard

Overview of this book

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. After getting a quick overview of what data science is and the steps involved in the process, you’ll learn the statistical data analysis techniques and implement them using the popular Java APIs and libraries. Through practical examples, you will also learn the machine learning concepts such as classification and regression. In the process, you’ll familiarize yourself with tools such as Rapidminer and WEKA and see how these Java-based tools can be used effectively for analysis. You will also learn how to analyze text and other types of multimedia. Learn to work with relational, NoSQL, and time-series data. This book will also show you how you can utilize different Java-based libraries to create insightful and easy to understand plots and graphs. By the end of this book, you will have a solid understanding of the various data analysis techniques, and how to implement them using Java.
Table of Contents (20 chapters)
Java Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Descriptive statistics


A descriptive statistic is a function that computes a numeric value which in some way summarizes the data in a numeric dataset.

We saw two statistics in Chapter 3, Data Visualization: the sample mean, , and the sample standard deviation, s. Their formulas are:

The mean summarizes the central tendency of the dataset. It is also called the simple average or mean average. The standard deviation is a measure of the dispersion of the dataset. Its square, s2, is called the sample variance.

The maximum of a dataset is its greatest value, the minimum is its least value, and the range is their difference.

If w = (w1, w2, …, wn) is a vector with the same number of components as the dataset, then we can use it to define the weighted mean:

In linear algebra, this expression is called the inner product of the two vectors, w and x = (x1, x2, …, xn). Note that if we choose all the weights to be 1/n, then the resulting weighted mean is just the sample mean.

The median of a dataset...