In this recipe, we will learn how to construct our first heat map in R from the AirPassenger
data set, which is a standard data set included in the data
package that is available with R distributions. For this task, we will use the levelplot()
function from the lattice
package and explore the enhanced features of the gplots
package, the heatmap.2()
function.
The following image shows one of the heat maps that we are going to create in this recipe from the total count of air passengers:
Download the script 5644_01_01.r
from your account at http://www.packtpub.com and save it to your hard disk. The first section of the script, below the comment line starting with ### loading packages
, will automatically check for the availability of the R packages gplots
and lattice
, which are required for this recipe.
If those packages are not already installed, you will be prompted to select an official server from the Comprehensive R Archive Network (CRAN) to allow the automatic download and installation of the required packages.
If you have already installed those two packages prior to executing the script, I recommend you to update them to the most recent version by calling the following function in the R command line:
update.packages("gplots") update.packages("lattice")
Tip
Use the source()
function in the R command-line to execute an external script from any location on your hard drive.
If you start a new R session from the same directory as the location of the script, simply provide the name of the script as an argument in the function call as follows:
source("myScript.r")
You have to provide the absolute or relative path to the script on your hard drive if you started your R session from a different directory to the location of the script. Refer to the following example:
source("/home/username/Downloads/myScript.r")
You can view the current working directory of your current R session by executing the following command in the R command-line:
getwd()
Run the 5644OS_01_01.r
script in R to execute the following code, and take a look at the output printed on the screen as well as the PDF file, first_heatmaps.pdf
that will be created by this script:
### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(gplots) } if (!require("lattice")) { install.packages("lattice", dependencies = TRUE) library(lattice) } ### loading data data(AirPassengers) ### converting data rowcolNames <- list(as.character(1949:1960), month.abb) air_data <- matrix(AirPassengers, ncol = 12, byrow = TRUE, dimnames = rowcolNames) ### drawing heat maps pdf("firstHeatmaps.pdf") # 1) Air Passengers #1 print(levelplot(air_data, col.regions=heat.colors, xlab = "year", ylab = "month", main = "Air Passengers #1")) # 2) Air Passengers #2 heatmap.2(air_data, trace = "none", density.info = "none", xlab = "month", ylab = "year", main = "Air Passengers #2") # 3) Air Passengers #3 heatmap.2(air_data, trace = "none", xlab = "month", ylab = "year", main = "Air Passengers #3", density.info = "histogram", dendrogram = "column", keysize = 1.8) dev.off()
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
There are different functions for drawing heat maps in R, and each has its own advantages and disadvantages. In this recipe, we will take a look at the levelplot()
function from the lattice
package to draw our first heat map. Furthermore, we will use the advanced heatmap.2()
function from gplots
to apply a clustering algorithm to our data and add the resulting dendrograms to our heat maps.
The following image shows an overview of the different plotting functions that we are using throughout this book:
Now let us take a look at how we read in and process data from different data files and formats step-by-step:
Loading packages: The first eight lines preceding the
### loading data
section will make sure that R loads thelattice
andgplots
package, which we need for the two heat map functions in this recipe:levelplot()
andheatmap.2()
.Loading the data set: R includes a package called
data
, which contains a variety of different data sets for testing and exploration purposes. More information on the different data sets that are contained in thedata
package can be found at http://stat.ethz.ch/ROmanual/ROpatched/library/datasets/.For this recipe, we are loading the
AirPassenger
data set, which is a collection of the total count of air passengers (in thousands) for international airlines from 1949-1960 in a time-series format.data(AirPassengers)
Converting the data set into a numeric matrix: Before we can use the heat map functions, we need to convert the
AirPassenger
time-series data into a numeric matrix first. Numeric matrices in R can have characters as row and column labels, but the content itself must consist of one single mode: numerical.We use the
matrix()
function to create a numeric matrix consisting of 12 columns to which we pass theAirPassenger
time-series data row-by-row. Using the argumentdimnames = rowcolNames
, we provide row and column names that we assigned previously to the variablerowColNames
, which is a list of two vectors: a series of 12 strings representing the years 1949 to 1960, and a series of strings for the 12 three-letter abbreviations of the months from January to December, respectively.rowcolNames <- list(as.character(1949:1960), month.abb) air_data <- matrix(AirPassengers, ncol = 12, byrow = TRUE, dimnames = rowcolNames)
A si mple heat map using levelplot(): Now that we have converted the
AirPassenger
data into a numeric matrix format and assigned it to the variableair_data
, we can go ahead and construct our first heat map using thelevelplot()
function from thelattice
package:print(levelplot(air_data, col.regions=heat.colors, xlab = "year", ylab = "month", main = "Air Passengers #1"))
The
levelplot()
function creates a simple heat map with a color key to the right-hand side of the map. We can use the argumentcol.regions = heat.colors
to change the default color transition to yellow and red. X and y axis labels are specified by thexlab
andylab
parameters, respectively, and themain
parameter gives our heat map its caption.Note
In contrast to most of the other plotting functions in R, the
lattice
package returns objects, so we have to use theprint()
function in our script if we want to save the plot to a data file. In an interactive R session, theprint()
call can be omitted. Typing the name of the variable will automatically display the referring object on the screen.Creating enhanced heat maps with heatmap.2(): Next, we will use the
heatmap.2()
function to apply a clustering algorithm to theAirPassenger
data and to add row and column dendrograms to our heat map:heatmap.2(air_data, trace = "none", density.info = "none", xlab = "month", ylab = "year", main = "Air Passengers #2")
Note
Hierarchical clustering is especially popular in gene expression analyses. It is a very powerful method for grouping data to reveal interesting trends and patterns in the data matrix.
Another neat feature of
heatmap.2()
is that you can display a histogram of the count of the individual values inside the color key by including the argumentdensity.info = NULL
in the function call. Alternatively, you can setdensity.info = "density"
for displaying a density plot inside the color key.By adding the argument
keysize = 1.8
, we are slightly increasing the size of the color key—the default value ofkeysize
is1.5
:heatmap.2(air_data, trace = "none", xlab = "month", ylab = "year", main = "Air Passengers #3", density.info = "histogram", dendrogram = "column", keysize = 1.8)
Did you notice the missing row dendrogram in the resulting heat map? This is due to the argument
dendrogram = "column"
that we passed to the heat map function. Similarly, you can type row instead of column to suppress the column dendrogram, or use neither to draw no dendrogram at all.
By default, levelplot()
places the color key on the right-hand side of the heat map, but it can be easily moved to the top, bottom, or left-hand side of the map by modifying the space
parameter of colorkey
:
levelplot(air_data, col.regions = heat.colors, colorkey = list(space = "top"))
Replacing top
by left
or bottom
will place the color key on the left-hand side or on the bottom of the heat map, respectively.
Moving around the color key for heatmap.2()
can be a little bit more of a hassle. In this case we have to modify the parameters of the layout()
function. By default, heatmap.2()
passes a matrix, lmat
, to layout()
, which has the following content:
[,1] [,2] [1,] 4 3 [2,] 2 1
The numbers in the preceding matrix specify the locations of the different visual elements on the plot (1
implies heat map, 2
implies row dendrogram, 3
implies column dendrogram, and 4
implies key). If we want to change the position of the key, we have to modify and rearrange those values of lmat
that heatmap.2()
passes to layout()
.
For example, if we want to place the color key at the bottom left-hand corner of the heat map, we need to create a new matrix for lmat
as follows:
lmat [,1] [,2] [1,] 0 3 [2,] 2 1 [3,] 4 0
We can construct such a matrix by using the rbind()
function and assigning it to lmat
:
lmat = rbind(c(0,3),c(2,1),c(4,0))
Furthermore, we have to pass an argument for the column height parameter lhei
to heatmap.2()
, which will allow us to use our modified lmat
matrix for rearranging the color key:
heatmap.2(air_data, dendrogram = "none", trace = "none", density.info = "none", keysize = "1.3", xlab = "month", ylab = "year", main = "Air Passengers", lmat = rbind(c(0,3),c(2,1),c(4,0)), lhei = c(1.5,4,1.5))
If you don't need a color key for your heat map, you could turn it off by using the argument key = FALSE
for heatmap.2()
and colorkey = FALSE
for levelplot()
, respectively.
Tip
R also has a base function for creating heat maps that does not require you to install external packages and is most advantageous if you can go without a color key. The syntax is very similar to the heatmap.2()
function, and all options for heatmap.2()
that we have seen in this recipe also apply to heatmap()
:
heatmap(air_data, xlab = "month", ylab = "year", main = "Air Passengers")
By default, the dendrograms of heatmap.2()
are created by a hierarchical agglomerate clustering method, also known as bottom-up clustering.
In this approach, all individual objects start as individual clusters and are successively merged until only one single cluster remains. The distance between a pair of clusters is calculated by the farthest neighbor method, also called the complete linkage method, which is based by default on the Euclidean distance of the two points from both clusters that are farthest apart from each other. The computed dendrograms are then reordered based on the row and column means.
By modifying the default parameters of the dist()
function, we can use another distance measure rather than the Euclidean distance. For example, if we want to use the Manhattan distance measure (based on a grid-like path rather than a direct connection between two objects), we would modify the method
parameter of the dist()
function and assign it to a variable distance
first:
distance = dist(myData, method = "manhattan")
Other options for the method
parameter are: euclidean
(default), maximum
, canberra
, binary
, or minkowski
.
To use other agglomeration methods than the complete linkage method, we modify the method
parameter in the hclust()
function and assign it to another variable cluster
. Note the first argument distance
that we pass to the hclust()
function, which comes from our previous assignment:
cluster = hclust(distance, method = "ward")
By setting the method
parameter to ward
, R will use Joe H. Ward's minimum variance method for hierarchical clustering. Other options for the method
parameter that we can pass as arguments to hclust()
are: complete
(default), single
, average
, mcquitty
, median
, or centroid
.
To use our modified clustering parameters, we simply call the as.dendrogram()
function within heatmap.2()
using the variable cluster
that we assigned previously:
heatmap.2(myData, Colv = as.dendrogram(cluster), Rowv = as.dendrogram(cluster))
We can also draw the cluster dendrogram without the heat map by using the plot()
function:
plot(cluster)