Book Image

Instant Heat Maps in R How-to

By : Sebastian Raschka
Book Image

Instant Heat Maps in R How-to

By: Sebastian Raschka

Overview of this book

R has grown rapidly over the years to become one of the most versatile and valuable tools for data analysis and graphing. One of its many useful features is the heat map representation of numerical data, which is an invaluable tool to discover patterns in data quickly and efficiently. Instant Heat Maps in R How-to provides you with practical recipes to create heat maps of all difficulty levels by yourself right from the start. At the end of each recipe, you will find an in-depth analysis that will equip you with everything you need to know to frame the code to your own needs. Instant Heat Maps in R will present you with all the different heat map plotting functions that exist in R. You will start by creating simple heat maps before moving on to learn how to add more features to them. While you advance step-by-step through the well-connected recipes, you will find out which tool suits the given situation best. You will learn how to read data from popular file formats and how to format the data to create heat maps as well as the ways to export them for presentation.
Table of Contents (7 chapters)

Creating your first heat map in R (Simple)


In this recipe, we will learn how to construct our first heat map in R from the AirPassenger data set, which is a standard data set included in the data package that is available with R distributions. For this task, we will use the levelplot() function from the lattice package and explore the enhanced features of the gplots package, the heatmap.2() function.

The following image shows one of the heat maps that we are going to create in this recipe from the total count of air passengers:

Getting ready

Download the script 5644_01_01.r from your account at http://www.packtpub.com and save it to your hard disk. The first section of the script, below the comment line starting with ### loading packages, will automatically check for the availability of the R packages gplots and lattice, which are required for this recipe.

If those packages are not already installed, you will be prompted to select an official server from the Comprehensive R Archive Network (CRAN) to allow the automatic download and installation of the required packages.

If you have already installed those two packages prior to executing the script, I recommend you to update them to the most recent version by calling the following function in the R command line:

update.packages("gplots")
update.packages("lattice")

Tip

Use the source() function in the R command-line to execute an external script from any location on your hard drive.

If you start a new R session from the same directory as the location of the script, simply provide the name of the script as an argument in the function call as follows:

source("myScript.r")

You have to provide the absolute or relative path to the script on your hard drive if you started your R session from a different directory to the location of the script. Refer to the following example:

source("/home/username/Downloads/myScript.r")

You can view the current working directory of your current R session by executing the following command in the R command-line:

getwd()

How to do it...

Run the 5644OS_01_01.r script in R to execute the following code, and take a look at the output printed on the screen as well as the PDF file, first_heatmaps.pdf that will be created by this script:

### loading packages
if (!require("gplots")) {
install.packages("gplots", dependencies = TRUE)
library(gplots)
}
if (!require("lattice")) {
install.packages("lattice", dependencies = TRUE)
library(lattice)
}

### loading data
data(AirPassengers)

### converting data
rowcolNames <- list(as.character(1949:1960), month.abb)
air_data <- matrix(AirPassengers, 
  ncol = 12, 
  byrow = TRUE, 
  dimnames = rowcolNames)

### drawing heat maps
pdf("firstHeatmaps.pdf")

# 1) Air Passengers #1
print(levelplot(air_data,
  col.regions=heat.colors, 
  xlab = "year",
  ylab = "month", 
  main = "Air Passengers #1"))

# 2) Air Passengers #2
heatmap.2(air_data, 
  trace = "none", 
  density.info = "none", 
  xlab = "month", 
  ylab = "year", 
  main = "Air Passengers #2")

# 3) Air Passengers #3
heatmap.2(air_data, 
  trace = "none", 
  xlab = "month", 
  ylab = "year", 
  main = "Air Passengers #3",
  density.info = "histogram",
  dendrogram = "column",
  keysize = 1.8)

dev.off()

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

How it works...

There are different functions for drawing heat maps in R, and each has its own advantages and disadvantages. In this recipe, we will take a look at the levelplot() function from the lattice package to draw our first heat map. Furthermore, we will use the advanced heatmap.2() function from gplots to apply a clustering algorithm to our data and add the resulting dendrograms to our heat maps.

The following image shows an overview of the different plotting functions that we are using throughout this book:

Now let us take a look at how we read in and process data from different data files and formats step-by-step:

  1. Loading packages: The first eight lines preceding the ### loading data section will make sure that R loads the lattice and gplots package, which we need for the two heat map functions in this recipe: levelplot() and heatmap.2().

    Note

    Each time we start a new session in R, we have to load the required packages in order to use the levelplot() and heatmap.2() functions. To do so, enter the following function calls directly into the R command-line or include them at the beginning of your script:

    • library(lattice)

    • library(gplots)

  2. Loading the data set: R includes a package called data, which contains a variety of different data sets for testing and exploration purposes. More information on the different data sets that are contained in the data package can be found at http://stat.ethz.ch/ROmanual/ROpatched/library/datasets/.

    For this recipe, we are loading the AirPassenger data set, which is a collection of the total count of air passengers (in thousands) for international airlines from 1949-1960 in a time-series format.

    data(AirPassengers)
  3. Converting the data set into a numeric matrix: Before we can use the heat map functions, we need to convert the AirPassenger time-series data into a numeric matrix first. Numeric matrices in R can have characters as row and column labels, but the content itself must consist of one single mode: numerical.

    We use the matrix() function to create a numeric matrix consisting of 12 columns to which we pass the AirPassenger time-series data row-by-row. Using the argument dimnames = rowcolNames, we provide row and column names that we assigned previously to the variable rowColNames, which is a list of two vectors: a series of 12 strings representing the years 1949 to 1960, and a series of strings for the 12 three-letter abbreviations of the months from January to December, respectively.

    rowcolNames <- list(as.character(1949:1960), month.abb)
    air_data <- matrix(AirPassengers, 
      ncol = 12, 
      byrow = TRUE, 
      dimnames = rowcolNames)
  4. A si mple heat map using levelplot(): Now that we have converted the AirPassenger data into a numeric matrix format and assigned it to the variable air_data, we can go ahead and construct our first heat map using the levelplot() function from the lattice package:

    print(levelplot(air_data,
      col.regions=heat.colors, 
      xlab = "year",
      ylab = "month", 
      main = "Air Passengers #1"))

    The levelplot() function creates a simple heat map with a color key to the right-hand side of the map. We can use the argument col.regions = heat.colors to change the default color transition to yellow and red. X and y axis labels are specified by the xlab and ylab parameters, respectively, and the main parameter gives our heat map its caption.

    Note

    In contrast to most of the other plotting functions in R, the lattice package returns objects, so we have to use the print() function in our script if we want to save the plot to a data file. In an interactive R session, the print() call can be omitted. Typing the name of the variable will automatically display the referring object on the screen.

  5. Creating enhanced heat maps with heatmap.2(): Next, we will use the heatmap.2() function to apply a clustering algorithm to the AirPassenger data and to add row and column dendrograms to our heat map:

    heatmap.2(air_data, 
        trace = "none", 
        density.info = "none", 
      xlab = "month", 
      ylab = "year", 
      main = "Air Passengers #2")

    Note

    Hierarchical clustering is especially popular in gene expression analyses. It is a very powerful method for grouping data to reveal interesting trends and patterns in the data matrix.

    Another neat feature of heatmap.2() is that you can display a histogram of the count of the individual values inside the color key by including the argument density.info = NULL in the function call. Alternatively, you can set density.info = "density" for displaying a density plot inside the color key.

    By adding the argument keysize = 1.8, we are slightly increasing the size of the color key—the default value of keysize is 1.5:

    heatmap.2(air_data, 
      trace = "none", 
      xlab = "month", 
      ylab = "year", 
      main = "Air Passengers #3",
    density.info = "histogram",  
    dendrogram = "column",
      keysize = 1.8) 

    Did you notice the missing row dendrogram in the resulting heat map? This is due to the argument dendrogram = "column" that we passed to the heat map function. Similarly, you can type row instead of column to suppress the column dendrogram, or use neither to draw no dendrogram at all.

There's more...

By default, levelplot() places the color key on the right-hand side of the heat map, but it can be easily moved to the top, bottom, or left-hand side of the map by modifying the space parameter of colorkey:

levelplot(air_data, 
col.regions = heat.colors, 
colorkey = list(space = "top"))

Replacing top by left or bottom will place the color key on the left-hand side or on the bottom of the heat map, respectively.

Moving around the color key for heatmap.2() can be a little bit more of a hassle. In this case we have to modify the parameters of the layout() function. By default, heatmap.2() passes a matrix, lmat, to layout(), which has the following content:

     [,1] [,2]
[1,]    4    3
[2,]    2    1

The numbers in the preceding matrix specify the locations of the different visual elements on the plot (1 implies heat map, 2 implies row dendrogram, 3 implies column dendrogram, and 4 implies key). If we want to change the position of the key, we have to modify and rearrange those values of lmat that heatmap.2() passes to layout().

For example, if we want to place the color key at the bottom left-hand corner of the heat map, we need to create a new matrix for lmat as follows:

lmat
     [,1] [,2]
[1,]    0    3
[2,]    2    1
[3,]    4    0

We can construct such a matrix by using the rbind() function and assigning it to lmat:

lmat = rbind(c(0,3),c(2,1),c(4,0))

Furthermore, we have to pass an argument for the column height parameter lhei to heatmap.2(), which will allow us to use our modified lmat matrix for rearranging the color key:

heatmap.2(air_data, 
dendrogram = "none",
trace = "none",
density.info = "none", 
keysize = "1.3",
xlab = "month", 
ylab = "year", 
main = "Air Passengers", 
lmat = rbind(c(0,3),c(2,1),c(4,0)), 
lhei = c(1.5,4,1.5))

If you don't need a color key for your heat map, you could turn it off by using the argument key = FALSE for heatmap.2() and colorkey = FALSE for levelplot(), respectively.

Tip

R also has a base function for creating heat maps that does not require you to install external packages and is most advantageous if you can go without a color key. The syntax is very similar to the heatmap.2() function, and all options for heatmap.2() that we have seen in this recipe also apply to heatmap():

heatmap(air_data, 
    xlab = "month", 
   ylab = "year", 
   main = "Air Passengers")

More information on dendrograms and clustering

By default, the dendrograms of heatmap.2() are created by a hierarchical agglomerate clustering method, also known as bottom-up clustering.

In this approach, all individual objects start as individual clusters and are successively merged until only one single cluster remains. The distance between a pair of clusters is calculated by the farthest neighbor method, also called the complete linkage method, which is based by default on the Euclidean distance of the two points from both clusters that are farthest apart from each other. The computed dendrograms are then reordered based on the row and column means.

By modifying the default parameters of the dist() function, we can use another distance measure rather than the Euclidean distance. For example, if we want to use the Manhattan distance measure (based on a grid-like path rather than a direct connection between two objects), we would modify the method parameter of the dist() function and assign it to a variable distance first:

distance = dist(myData, method = "manhattan")

Other options for the method parameter are: euclidean (default), maximum, canberra, binary, or minkowski.

To use other agglomeration methods than the complete linkage method, we modify the method parameter in the hclust() function and assign it to another variable cluster. Note the first argument distance that we pass to the hclust() function, which comes from our previous assignment:

cluster = hclust(distance, method = "ward")

By setting the method parameter to ward, R will use Joe H. Ward's minimum variance method for hierarchical clustering. Other options for the method parameter that we can pass as arguments to hclust() are: complete (default), single, average, mcquitty, median, or centroid.

To use our modified clustering parameters, we simply call the as.dendrogram() function within heatmap.2() using the variable cluster that we assigned previously:

heatmap.2(myData, 
Colv = as.dendrogram(cluster), 
Rowv = as.dendrogram(cluster))

We can also draw the cluster dendrogram without the heat map by using the plot() function:

plot(cluster)

Tip

To turn off row and column reordering, we need to turn off the dendrograms and set the parameters Colv and Rowv to NA:

heatmap.2(my_data, dendrograms = "none", Colv = "NA", Rowv = "NA")