Visualization is a powerful way to communicate information through graphical means. Visual presentations make data easier to comprehend. This recipe presents some basic functions to plot charts, and demonstrates how visualizations are helpful in data exploration.
Ensure that you have completed the previous recipes by installing R on your operating system.
Perform the following steps to visualize a dataset:
- Load the iris data into the R session:
> data(iris)
- Calculate the frequency of species within the iris using the
table
command:
> table.iris = table(iris$Species) > table.iris Output: setosa versicolor virginica 50 50 50
- As the frequency in the table shows, each species represents 1/3 of the
iris
data. We can draw a simple pie chart to represent the distribution of species within the iris:
> pie(table.iris) Output:
The pie chart of species distribution
- The histogram creates a frequency plot of sorts along the x-axis. The following example produces a histogram of the sepal length:
> hist(iris$Sepal.Length)
The histogram of the sepal length
- In the histogram, the x-axis presents the sepal length and the y-axis presents the count for different sepal lengths. The histogram shows that for most irises, sepal lengths range from 4 cm to 8 cm.
- Boxplots, also named box and whisker graphs, allow you to convey a lot of information in one simple plot. In such a graph, the line represents the median of the sample. The box itself shows the upper and lower quartiles. The whiskers show the range:
> boxplot(Petal.Width ~ Species, data = iris)
The boxplot of the petal width
- The preceding screenshot clearly shows the median and upper range of the petal width of the
setosa
is much shorter thanversicolor
andvirginica
. Therefore, the petal width can be used as a substantial attribute to distinguish iris species. - A scatter plot is used when there are two variables to plot against one another. This example plots the petal length against the petal width and color dots in accordance to the species it belongs to:
> plot(x=iris$Petal.Length, y=iris$Petal.Width, col=iris$Species)
The scatter plot of the sepal length
- The preceding screenshot is a scatter plot of the petal length against the petal width. As there are four attributes within the
iris
dataset, it takes six operations to plot all combinations. However, R provides a function namedpairs
, which can generate each subplot in one figure:
> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
Pairs scatterplot of iris data
R provides many built-in plot functions, which enable users to visualize data with different kinds of plots. This recipe demonstrates the use of pie charts that can present category distribution. A pie chart of an equal size shows that the number of each species is equal. A histogram plots the frequency of different sepal lengths. A box plot can convey a great deal of descriptive statistics, and shows that the petal width can be used to distinguish an iris species. Lastly, we introduced scatter plots, which plot variables on a single plot. In order to quickly generate a scatter plot containing all the pairs of iris
dataset, one may use the pairs
command.
ggplot2
is another plotting system for R, based on the implementation of Leland Wilkinson's grammar of graphics. It allows users to add, remove, or alter components in a plot with a higher abstraction. However, the level of abstraction results is slow compared to lattice graphics. For those of you interested in the topic ofggplot
, you can refer to this site: http://ggplot2.org/.