A site to help Biochemists learn R.

Starting points

Monday, 7 March 2016

Gene Expression Analysis and Visualization for VizBi 2016 (Pt 1)

UPDATE: During the VizBi2016 tutorial session, the participants noticed a couple of errors in the script. I think these have now been corrected.
Also, Christian Hauer suggested that we visualise the distance matrix using the heatmap packages. He suggested pheatmap, in particular. Thanks Christian.

I have spent some time creating two quite long scripts that generate a selection of visualisation of some gene expression data generated recently by Mel Boyd at the Cardiff CLL Research Group. The script illustrates a workflow that results in a selection of interesting and useful visualisations that allow us to compare the different drug treatments used.

The data was generated from an Affymetrix HTA 2.0 chip. The data has been normalised and mapped using the oligo package. I have focussed on the protein coding transcripts and I have generated a random selection of 5000 of these genes.

This part of the experiment uses the data from four different experimental 'drugs' which target different pathways and have different cellular effects. This analysis and the associated visualisations are used to allow us to compare the drug treatments and look at the data as a whole set.

I am happiest with the visualisation that is generated by the loadings of the principal component analysis but there are some other good visualisations generated by the scripts too.
The cluster at the top right is from the drug that causes the most changes. The other drugs cause few changes and so map closer to the untreated samples.

Next script shows volcano plots and a heatmap.

Here is the script:
# download the data from github
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/microArrayData.tsv")
data <- read.table(text = x, header = TRUE, sep = "\t")

# basic commands for looking at the data
head(data)      # view first six rows
str(data)       # shows it's a dataframe
colnames(data)  # column names
View(data)      # this works in R-Studio

# for first part of analysis
# subset of the data with just the numbers.
new.data <- data[2:16]
# these are log2 values 
# The data has been normalized

## FIRST visualisation of the data: a simple box plot
        las =2, # las = 2 turns the text around to show sample names
        ylab = expression(bold("expression")),
        main="Boxplot of expression data")  

## SECOND visualisation(s): calculate and visualise distances
# turn into a matrix 
data.m <- as.matrix(new.data)
# transpose the data because a distance matrix works in rows
data.m.t <- t(data.m)
# calculate the distances and put the calculations into an object called distances
distances <- dist(data.m.t)
# convert this distances object into a matrix. 
distances.m <- data.matrix(distances)
# you can look at this object.
# we can extract the size of the object and the titles
dim <- ncol(distances.m)
names <- row.names(distances.m)
# now to create the visualisation of the difference matrix. 
# first the coloured boxes
image(1:dim, 1:dim, distances.m, axes = FALSE, xlab = "", ylab = "")
# now label the axis
axis(3, 1:dim, names, cex.axis = 0.8, las=3)
axis(2, 1:dim, names, cex.axis = 0.8, las=1)
# add the values of the differences
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", distances.m), cex=1)

# heatmaps ccan also be used to visualise the distance matrix. 
# samples can be clustered or you can stop this. Thanks Christian Hauer



## THIRD visualisation: use the distance matrix to do some clustering

# three clusters - based on the distance matrix and shows the same thing really.

## FOURTH visualisation: principal component analysis
# Do a Principal Component Analysis  
# and widely-used technique for viewing patterns of gross variation
# in datasets
# We can see whether an array is substantially different to the others
pca <- princomp(data.m)  # function that does the PCA
summary(pca)  # one component accounts for 99% of the variance
plot(pca, type = "l")  

names <- factor(gsub("\\.\\d", "", names)) # change names into factor
plot(pca$loadings, col = names, pch = 19)
text(pca$loadings, cex = 0.7, label = colnames(new.data), pos =3)

# some of the treatment cluster well together (e.g. Drug D) others not so much

# can be interesting to look at a subset of the data...
pca <- princomp(data.m[,1:12])
plot(pca$loadings, col = names, pch = 19, 
     main = "Just three drug treatments")
text(pca$loadings, cex = 0.7, label = colnames(new.data)[1:12], pos =3)

# basically happy with our data generally. 
# next step in next script is to look at differentially expressed transcripts.


  1. Hi Paul,
    Thanks for sharing your knowledge. I am wondering if I could find the script that has been used for the generation of these data ( per-processing, Normalization, deferentially expressed gene, etc. ). I am currently running some samples on Affymetrix HTA 2.0 chip and would like to know how to process them. I can figure out how to correctly set design and contrast matrices, since I don't have Treatment Vs Untreated group, I want all the samples beings compare against each other.

    Thanks in advance.

  2. Hi Hossein,
    Thanks for your message.
    The packages and workflow are show in the script on this page:
    Please give it a go. The downloading can be a little slow.
    If it works, do say.
    If not, please tell me and I will try to help further.
    Best wishes,


Comments and suggestions are welcome.