## Exploratory data analysis in Java

Exploratory Data Analysis is about taking a dataset and extracting the most important information from it, in such a way that it is possible to get an idea of what the data looks like. This includes two main parts: summarization and visualization.

The summarization step is very helpful for understanding data. For numerical variables, in this step we calculate the most important sample statistics:

- The extremes (the minimal and the maximal values)
- The mean value, or the sample average
- The standard deviation, which describes the spread of the data

Often we consider other statistics, such as the median and the quartiles (25% and 75%).

As we have already seen in the previous chapter, Java offers a great set of tools for data preparation. The same set of tools can be used for EDA, and especially for creating summaries.

### Search engine datasets

In this chapter, we will use our running example--building a search engine. In Chapter 2, *Data Processing Toolbox*, we extracted...