R for Biochemists

Monday, 7 March 2016

Gene Expression Analysis and Visualization for VizBi 2016 (Pt 1)

UPDATE: During the VizBi2016 tutorial session, the participants noticed a couple of errors in the script. I think these have now been corrected.
Also, Christian Hauer suggested that we visualise the distance matrix using the heatmap packages. He suggested pheatmap, in particular. Thanks Christian.

I have spent some time creating two quite long scripts that generate a selection of visualisation of some gene expression data generated recently by Mel Boyd at the Cardiff CLL Research Group. The script illustrates a workflow that results in a selection of interesting and useful visualisations that allow us to compare the different drug treatments used.

The data was generated from an Affymetrix HTA 2.0 chip. The data has been normalised and mapped using the oligo package. I have focussed on the protein coding transcripts and I have generated a random selection of 5000 of these genes.

This part of the experiment uses the data from four different experimental 'drugs' which target different pathways and have different cellular effects. This analysis and the associated visualisations are used to allow us to compare the drug treatments and look at the data as a whole set.

I am happiest with the visualisation that is generated by the loadings of the principal component analysis but there are some other good visualisations generated by the scripts too.

The cluster at the top right is from the drug that causes the most changes. The other drugs cause few changes and so map closer to the untreated samples.

Next script shows volcano plots and a heatmap.

Here is the script:
# START
# download the data from github
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/microArrayData.tsv")
data <- read.table(text = x, header = TRUE, sep = "\t")

# basic commands for looking at the data
head(data) # view first six rows
str(data) # shows it's a dataframe
colnames(data) # column names
View(data) # this works in R-Studio

# for first part of analysis
# subset of the data with just the numbers.
new.data <- data[2:16]
# these are log2 values
# The data has been normalized

## FIRST visualisation of the data: a simple box plot
boxplot(new.data,
las =2, # las = 2 turns the text around to show sample names
ylab = expression(bold("expression")),
main="Boxplot of expression data")

## SECOND visualisation(s): calculate and visualise distances
# turn into a matrix
data.m <- as.matrix(new.data)
# transpose the data because a distance matrix works in rows
data.m.t <- t(data.m)
# calculate the distances and put the calculations into an object called distances
distances <- dist(data.m.t)
# convert this distances object into a matrix.
distances.m <- data.matrix(distances)
# you can look at this object.
View(distances.m)
# we can extract the size of the object and the titles
dim <- ncol(distances.m)
names <- row.names(distances.m)
# now to create the visualisation of the difference matrix.
# first the coloured boxes
image(1:dim, 1:dim, distances.m, axes = FALSE, xlab = "", ylab = "")
# now label the axis
axis(3, 1:dim, names, cex.axis = 0.8, las=3)
axis(2, 1:dim, names, cex.axis = 0.8, las=1)
# add the values of the differences
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", distances.m), cex=1)

# heatmaps ccan also be used to visualise the distance matrix.
# samples can be clustered or you can stop this. Thanks Christian Hauer
heatmap(distances.m)

heatmap3(distances.m)

pheatmap(distances.m)

## THIRD visualisation: use the distance matrix to do some clustering
plot(hclust(distances))

# three clusters - based on the distance matrix and shows the same thing really.

## FOURTH visualisation: principal component analysis
# Do a Principal Component Analysis
# and widely-used technique for viewing patterns of gross variation
# in datasets
# We can see whether an array is substantially different to the others
pca <- princomp(data.m) # function that does the PCA
summary(pca) # one component accounts for 99% of the variance
plot(pca, type = "l")

names <- factor(gsub("\\.\\d", "", names)) # change names into factor
plot(pca$loadings, col = names, pch = 19)
text(pca$loadings, cex = 0.7, label = colnames(new.data), pos =3)

# some of the treatment cluster well together (e.g. Drug D) others not so much

# can be interesting to look at a subset of the data...
pca <- princomp(data.m[,1:12])
plot(pca$loadings, col = names, pch = 19,
main = "Just three drug treatments")
text(pca$loadings, cex = 0.7, label = colnames(new.data)[1:12], pos =3)

# basically happy with our data generally.
# next step in next script is to look at differentially expressed transcripts.

Illustrating some R Fundamentals for VizBi 2016

I have surveyed the participants for my R tutorial at VizBi2016. The range of experience is a bit of a challenge. Some participants that have used R once or twice and others with plenty of experience. It's going to be a bit difficult to make the morning useful for everybody but I will try....

For the beginners, I'm going to start with this script which is designed to illustrate three of the fundamentals of R:

functions
objects
packages

It's an expansion of the script to draw a graph of a protein assay with ggplot:

Here is the script:
# First script for VizBi2016
# written to illustrate the Fundamentals of R

## first 'Fundamental' is ** functions **
# functions do things!
# you know it's a function because it contains brackets

# c() is a function - sometimes called combine

## second 'Fundamental' is ** objects **
# objects contain data
# we make them with functions

# example:
# using the c() function to create the object prot
# Protein Concentrations
prot <- c(0.000, 0.016, 0.031, 0.063, 0.125, 0.250, 0.500, 1.000,
0.000, 0.016, 0.031, 0.063, 0.125, 0.250, 0.500, 1.000)

# a function has arguments - always inside the brackets

# Absorbance from my protein assay
abs <- c(0.329, 0.352, 0.349, 0.379, 0.417, 0.491, 0.668, 0.956,
0.327, 0.341, 0.355, 0.383, 0.417, 0.446, 0.655, 0.905)

# these objects are called 'vectors' - key term

## now we are going ot play with some of these objects to

#Calculate the line using the linear model function lm()
line <- lm(abs~prot)

# creates another kind of object - a list
# multiple parts with different type of data in each part

# too look at the object type line
line
summary(line)

# access particular parts of the object line
# using the $ dollar sign
# Equation of a line y = mx + c
# In our case abs = slope * prot + intercept
# ukn.prot = (abs - intercept)/slope
int <- summary(line)$coefficients[1]
slope <- summary(line)$coefficients[2]

# now calculate some unknown protein concs from absorbances
# put the unknowns into a vector
abs.ukns <- c(0.554, 0.568, 0.705)

# rearrange the equation of the line to ukn.prot = (abs - intercept)/slope
prot.ukns <- (abs.ukns - int)/slope

# CONCEPT ALERT - functions work on a vector of numbers

## graphing

# quick graph using base R
plot(prot, abs)
abline(line)

## BUT there is a better way!!

## third 'Fundamental' is ** packages **
# these are collections of functions that have been written by others
# these can be installed
# install.packages("ggplot2")
# and then activated
library(ggplot2)

# Convert from one type of object to another
# another kind of object is a data.frame
# a bit like an Excel spreadsheet
# often when we import data we get a data frame
data <- as.data.frame(prot)
data$abs <- abs

# make a simple graph with ggplot
# step one - add the data and then the 'aesthetics'
# key asthetics in this case x and y
p <- ggplot(data=data, # specify the data frame with data
aes(x=prot, y=abs)) # specify x and y for the graph

# creates a list and a blank plot

# add a type of graph
p <- p + geom_point()

# show the graph
p

# add a line
p + stat_smooth(method = "lm")

# more detailed plot:
p <- ggplot(data=data, # specify the data frame with data
aes(x=prot, y=abs)) + # specify x and y for the graph
geom_point() + # make a scatter plot
stat_smooth(method = "lm") + # add a linear model line
xlab("[Protein] (microg/ml)") + # label x-axis
ylab("Absorbance (570nm)") + # label y-axis
ggtitle("Protein Assay") + # add a title
theme_bw() + # a simple theme
expand_limits(y=c(0,1)) + # customise the y-axis
annotate(geom="text", x=0.85, y= 0.6, label="Abs Prot", color="red")

# put the answers on the graph
for (i in 1:length(abs.ukns)){
p <- p + annotate(geom="text", x = 0.8, y = (0.6 - i/20), label=abs.ukns[i])
p <- p + annotate(geom="text", x = 0.92, y = (0.6 - i/20), label=round(prot.ukns[i], 3))
}

p # show us the graph...

## To Learn from this script:
# run this script line by line yourself and see what happens.
# watch what happens

# make the plot object again and try some things...
p <- ggplot(data=data, # specify the data frame with data
aes(x=prot, y=abs)) # specify x and y for the graph

# try to change the colour of the points
p + geom_point(colour = "blue")

# try to change the size of the points
p + geom_point(size = 5, colour = "red")

# look at the documentation for geom_point
# http://docs.ggplot2.org/current/geom_point.html
# try some of the functions and see if you can make sense of them

Monday, 8 February 2016

Visualising some CLL proteomic data for VizBi2016... Part 1...

I am preparing scripts for VisBi2016. I've decided to use a chronic lymphocytic leukaemia proteomic data set from a research group in Liverpool. It's the largest CLL proteomic dataset and it's published.

The key steps are:

Get the data into R using read_excel() function from the readxl package
Draw the first visualisation - a simple box plot
Calculate the distances between the samples
Draw a visualisation of the distance matrix
Cluster the samples and then draw a visualisation of the sample cluster
Finally, using the data from the paper, I have drawn a volcano plot.

Here is the script to do that....

START OF SCRIPT

# writing my first script for VizBi

# download and visualising the MCP CLL proteomics data

library(ggplot2)

library(readxl)

# I've decided that I would like to make CLL proteomics the first data set for the VizBi tutorials

# I've done some myself but the largest data set currently publically available is from the Liverpool group

# Reference: http://www.mcponline.org/content/14/4/933.full

# http://www.mcponline.org/content/suppl/2015/02/02/M114.044479.DC1/mcp.M114.044479-2.xls

link <- "http://www.mcponline.org/content/suppl/2015/02/02/M114.044479.DC1/mcp.M114.044479-2.xls"

# the download.file() function downloads and saves the file with the name given

download.file(url=link,destfile="file.xls", mode="wb")

# then we can open the file and extract the data using the read_excel() function.

data <- read_excel("file.xls")

# so the data is in R now in an object called "data'.

# The object "data" consists of 3521 observations of 24 variables.

# we can look at the data

View(data)

# it's made up of protein names, accession numbers, 18 samples and some statistics

# we can check the structure

str(data)

# this tells us about the types of data that make up each column.

# the fist thing we are going to do is make and visualise a distance matrix.

# columns 3 to 21 are the data.

# we can check this:

head(data[3:20]) # shows the first six rows of data

colnames(data[3:20]) # shows the column names.

# this seems correct.

# plot the data - always an important first step!

boxplot(data[3:20],

las =2, # las = 2 turns the text around to show sample names

ylab = expression(bold("expression")),

main="Boxplot of expression data")

# we can only calculate distances in a matrix where all the values are the same mode - e.g numbers

# convert data frame (data) into a matrix

# only want a subset of the data - the data from the samples.

data.m <- as.matrix(data[3:20])

# transpose the data because a distance matrix works in rows

data.m.t <- t(data.m)

# calculate the distances and put the calculations into an object called distances

distances <- dist(data.m.t)

# convert this distances object into a matrix.

distances.m <- data.matrix(distances)

# you can look at this object.

View(distances.m)

# we can extract the size of the object and the titles

dim <- ncol(distances.m)

names <- row.names(distances.m)

# now to create the visualisation of the difference matrix.

# first the coloured boxes

image(1:dim, 1:dim, distances.m, axes = FALSE, xlab = "", ylab = "")

# now label the axis

axis(3, 1:dim, names, cex.axis = 0.8, las=3)

axis(2, 1:dim, names, cex.axis = 0.8, las=1)

# add the values of the differences

text(expand.grid(1:18, 1:18), sprintf("%0.1f", distances.m), cex=1)

# look at this there is lots of variation between the samples....

# export this image as a tiff file with width of 1000 seems to work well.

# some of the other formats don't work as well.

# to make the cluster dendrogram object using the hclust() function

hc <- hclust(distances)

# plot the cluster diagram

# some interesting groups in the data

plot(hc,

xlab =expression(bold("All Samples")),

ylab = expression(bold("Distance")))

# interestingly the mutated and unmutated samples don't group together

# there are smaller clusters within the data

# this indicates that there is variation in the data set that is not explained by that grouping.

# we can draw a volcano plat because they have calculated the log 2 fold change and the p-value

colnames(data) <- make.names(colnames(data)) # gets rid of the spaces in the column names.

data$threshold = as.factor(data$P.Value < 0.05)

##Construct the plot object

g <- ggplot(data=data,

aes(x=Log2.Fold.Change, y =-log10(P.Value),

colour=threshold)) +

geom_point(alpha=0.4, size=1.75) +

xlim(c(-6, 6)) +

xlab("log2 fold change") + ylab("-log10 p-value") +

theme_bw() +

theme(legend.position="none")

g # shows us the object - the graph

END OF SCRIPT

Tuesday, 19 January 2016

ggplot2 2.0.0 geom_smooth() requires different arguments...

A new version of ggplot (2.0.0) was released on December 21st 2015. The new version has changed the requirements of the arguments in the geom_smooth() function in ggplot2 which I have used to generate non-linear curves for enzyme kinetics curves and an LD50 plot.

The old code (pre ggplot 2.0.0) has the formula as an argument in the geom_smooth() function. I've changed one of the scripts - the one for the LD50 plot. I'll change the enzymology scripts over the next few days.

Old code:
# Add the line to graph
p <- p + geom_smooth(method = "nls",
formula = y ~ bot+(top-bot)/(1+( x / LD50)^slope),
start=list(bot=20, top=95, LD50=3, slope=-12),
se = F, size = 0.5)

The new code makes the arguments for the method used for smoothing more obvious. I can see the value but it's made work for me in the short term.

New ggplot 2.0.0 code:

# Add the line to graph using methods.args (New: Jan 2016)

p <- p + geom_smooth(method = "nls",

method.args = list(formula = y ~ bot+(top-bot)/(1+( x / LD50)^slope),

start=list(bot=20, top=95, LD50=3, slope=-12)),

se = FALSE)

Thanks to 오세진 and Ali for commenting on my scripts and bringing this to my attention.

Thursday, 7 January 2016

Drawing a cytokine receptor with R - TNFR1...

During my research, I have studied the pro-inflammatory cytokine TNF and its receptor, TNFR1. TNF and TNFR1 are involved in inflammatory diseases such as rheumatoid arthritis. As part of developing my skills drawing molecules with R, I have written this script to draw TNFR1.

I like to draw receptors vertically with the thought that the cell plasma membrane lies horizontal to the view. The height of the rectangles are proportional to the number of amino acids.

I took the data from the UniProt page for TNFR1. Here is the diagram.

The key functions involved are the

plot() function that creates the space to draw in
rect() function that draws a rectangle
text() function that adds text at a designated spot.

Here is the script:

# START

# draw the Tumor Necrosis Factor Receptor 1 with R.

# draw it as a series of rectanges

# vertically rather than horizontal.

# using the Uniprot webpage for the informaton

# http://www.uniprot.org/uniprot/P19438

# cut and paste from the XML page to create two objects

# first object is a list containing accession number and names

## Step 1: a list containing names

# list of names...

names <- list(

accession = c("P19438"),

name = "TNR1A_HUMAN",

protein.recommendedName.fullName = "Tumor necrosis factor receptor superfamily member 1A",

protein.recommendedName.alternativeName = "Tumor necrosis factor receptor 1",

protein.recommendedName.alternativeName.shortName = "TNF-RI",

gene.name.primary = "TNFRSF1A",

gene.name.synonym = "TNFAR",

organism.name.scientific = "Homo sapiens"

)

## Step 2: getting the details of the molecule

# draw TNFR....

# Topological domain 22 211 Extracellular 189

# Transmembrane 212 234 Helical 22

# Topological domain 235 455 Cytoplasmic 220

# create the data frame with all the information.

# cut and past from the XML page to make vectors

# features to plot

types <- c("chain", "topological domain", "transmembrane region", "topological domain")

description <- c("Tumor necrosis factor receptor superfamily member 1A, membrane form", "Extracellular", "Helical","Cytoplasmic")

begin <- c(22,22, 212, 235)

end <- c(455, 211, 234, 455)

subbegin <- 500 - begin # do these subtractions because we want to draw from top to bottom.

subend <- 500 - end

col <- c("white", "blue", "black", "red") # my decision - easy to change.

# assemble vectors into a data frame

features <- data.frame(types, description, subbegin, subend, col)

# check the structure of the data frame

str(features)

# shows description and col to be factors - this will cause problems later...

# so change them now

features$description <- as.character(features$description)

features$col <- as.character(features$col)

## step 3: draw the diagram - but vertically

screen.width <- 25

screen.height <- 550 # this is a bit arbitrary

# we create a plot space to draw in....

plot(c(0, screen.width),

c(0, screen.height),

type= "n",

xlab = "", xaxt = 'n', # suppress the x label and x axis

ylab = "", yaxt = 'n') # suppress the y label and y axis

# make the rectangles in a loop

for (i in 1:length(features$types) ) {

rect(xleft = screen.width/2,

ytop = features$subbegin[i],

ybottom = features$subend[i],

xright = screen.width/2+2.5,

col = features$col[i])

}

## Step 4: add text

# text positioning all for horizontal code

# add text to the top of the illustration with the recommended name

text(screen.width/2, screen.height-2.5, names$protein.recommendedName.fullName, cex=1)

# and the alternative name

text(screen.width/2, screen.height-30, names$protein.recommendedName.alternativeName, cex=0.8)

# and source

text(12, 3, "Source of data: http://www.uniprot.org/uniprot/P19438", cex=0.8)

# add the descriptions of the features and source

# "chain" doesn't really help as a piece of text so going to leave this out.

pos.text.y <- features$subbegin[2:4] + (features$subend[2:4] - features$subbegin[2:4])/2

pos.text.x <- 7.5

text(pos.text.x, pos.text.y, features$description[2:4], cex=1, col=features$col[2:4])

text(pos.text.x, pos.text.y - 35, features$types[2:4], cex=1, col=features$col[2:4])

## Step 5: add protein domains using shading as detailed on Uniprot

# add more TNF receptor domains:

begin <- c(43, 83, 126, 167, 356)

subbegin <- 500 - begin

end <- c(82, 125, 166, 196, 441)

subend <- 500 - end

description <- c("TNFR-Cys 1", "TNFR-Cys 2", "TNFR-Cys 3", "TNFR-Cys 4", "Death" )

types <- c("repeat", "repeat", "repeat", "repeat", "domain")

density <- c(10, 10, 10, 10, 5) # this gives lines in the shading

angle <- c(45, 135, 45, 135, 15) # this gives angles of the lines

features <- data.frame(types, description, subbegin, subend, density, angle)

features$description <- as.character(features$description)

# make the rectangles in a loop

for (i in 1:length(features$types) ) {

rect(xleft = screen.width/2,

ytop = features$subbegin[i],

ybottom = features$subend[i],

xright = screen.width/2+2.5,

lwd = 2,

density = features$density[i],

angle = features$angle[i]

)

}

# add the descriptions of the features

x <- length(features$types)

pos.text.y <- features$subbegin[1:x] + (features$subend[1:x] - features$subbegin[1:x])/2

pos.text.x <- 19.5

text(pos.text.x, pos.text.y, paste(features$description[1:x], features$types[1:x]), cex=1)

# SCRIPT END

Wednesday, 16 December 2015

Counting cell nuclei in an image

I have been working with a colleague, Dr Joaquin de Navascues from the European Cancer Stem Cell Research Institute, to develop a workshop entitled an "Introduction to Biological Image Analysis". We plan to discuss the fundamentals of digital images and how to work with them in FIJI (Image J) and R, two open source data analysis tools. We aim to deliver this workshop during March or April next year at Cardiff University.

Joaquin is taking the lead on working with FIJI and I am developing the R material. For biological image analysis there is a useful package called EBImage. This has a nice introduction available and a detailed handbook - (version: 4.13.5).

I have been using this package to count cell nuclei in an image. The image is a microscope picture of a Drosophila gut. Counting nuclei involves mathematical transformations of the digital data. A digital image is a matrix of numbers.

In non-technical language the key steps are:

blur the image
apply a threshold to turn nuclei into 'blobs'
count the 'blobs'

The output from this script is:

Number of nuclei in this image = 92

The script below downloads an image from Github, opens the image, displays it, transforms it and then counts the nuclei. Because I plan to count nuclei from more than one image, I have made a function and then applied it to the downloaded file. Using user defined functions to automate your workflow is a very good use of R.

SCRIPT:

# to install use this:
# source("http://bioconductor.org/biocLite.R")
# biocLite("EBImage")

library(EBImage) # you might need to install - see above

# the image is on Github
# it is from a set of cells that are stained to detect the nuclei

# this is the link to the data

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/seq/seq_z015_c003.tif"

# the download.file() function downloads and saves the file

download.file(url=link, destfile="file.tif", mode="wb")

# EBImage uses the readImage() function to load the file.

img1 <- readImage("file.tif")

display(img1, method = "raster") # shows the image within R.

display(img1*4, method = "raster") # multiply the image to make brighter

# I have written a function to count nuclei

# includes blurring the image, applying a threshold and counting....

# it displays the image as it is changed.

# it's not perfect and overestimates the number of nuclei.

# it's an example that can be done.

# improving and customizing the various options is very feasible.

countNuclei <- function(img1){

# blur the image

w = makeBrush(size = 11, shape = 'gaussian', sigma = 5) # makes the blurring brush

img_flo = filter2(img1*2, w) # apply the blurring filter

display(img_flo * 4, method = "raster") # display the blurred image - brighter for display only.

# apply a threshold

nmaskt = thresh(img_flo *2, w=10, h=10, offset=0.05)

display(nmaskt, method = "raster")

# the bwlabel() function 'counts' the blobs

nucNo <- max(bwlabel(nmaskt))

# this outputs the count to us

cat('Number of nuclei in this image =', max(bwlabel(nmaskt)),'\n')

return(nucNo)

}

# this applies the function to the image

nucNo <- countNuclei(img1)

END OF SCRIPT

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff