Wednesday 23 March 2016

As a biochemist, where would you like to work from an Athena SWAN point of view...

As part of my role in the School of Medicine at Cardiff University, I am helping to prepare an application for an Athena SWAN Award. As part of this, I have prepared a list of Athena SWAN Awards across the UK over the last few years and looking at other applications to learn more about the process. The data is on github. I have been using this data to find Medical Schools that have Athena SWAN awards to use as comparison data for our application and to use as a style guide.

Athena SWAN is a quality mark that focuses on equality. As such, departments and Institutions with Athena SWAN awards should be good places to work. Furthermore the better the award, the better the institution as a work place. I decided to analyse the Department names from the point of view of a biochemist.

Here is a graph, I produced (similar to one I produced before and published here):


To do the analysis, I made a user-defined function using function(). The function is called countAwardType(). To use the function, you put in an argument - the name of a department or institution in the brackets: countAwardType("Biochemistry").

There are a two Gold Awards for "Bio" Departments. These are:

There are three for Chemistry:
Oxford University has the most awards - see below for Russell group graph.


SCRIPT START
# As a biochemist, where would you like to work for from an Athena SWAN point of view?

library(RCurl)
library(readxl)

# this is the link to the data
link <- "https://raw.githubusercontent.com/brennanpincardiff/AthenaSWANBenchmarkData/master/ASAwards/AthenaSWANAwardList2012_201520160316.xlsx"

# the download.file() function downloads and saves the file with the name given
download.file(url=link, destfile="file.xlsx", mode="wb")

# read in the file with the read_excel() function
awards<- read_excel("file.xlsx", col_names = TRUE)

award.type <- c("Gold", "Silver", "Bronze")

# I have created a function to count the different types of Award - Gold, Silver & Bronze
# and then return a data.frame with specific names
countAwardType <- function(x){
  
  # to test this function  make x <- c("Medicine") and go through it line by line. 
  
  # list of the different types of Athena SWAN awards
  award.type <- c("Gold", "Silver", "Bronze")
  
  # find the pattern supplied in the function as x e.g. "Medicine"
  # this creates a data.frame. 
  x.Awards <- awards[grep(x, awards$org),]
  
  # use split() function which separates the data.frame into a list by award level
  # there is a data.frame within each list.
  # this can be called by the dollar sign - $. 
  x.split <- split(x.Awards, x.Awards$Level)
  
  # count the number of rows in each list 
  gold <- nrow(x.split$Gold)
  silver <- nrow(x.split$Silver)
  bronze <- nrow(x.split$Bronze)
  
  # if there are no entries then a NULL value will be returned 
  # zero is better therefore recode the NULL values
  if(is.null(gold) == TRUE) {gold <- 0}
  if(is.null(silver) == TRUE) {silver <- 0}
  if(is.null(bronze) == TRUE) {bronze <- 0}
  
  # combine these variables into a vector
  aW <- c(gold, silver, bronze)
  
  # combine the vectors into a data.frame
  awardsDepts.x <- data.frame(x, aW, award.type)
  
  # rename the columns
  colnames(awardsDepts.x) <- c("Dept.Type", "Award.Count", "Award.Type")
  
  # return the answer. 
  return(awardsDepts.x)
}

# with this function, I can answer the following questions....

# How many Departments of "Biochemistry" have Athena SWAN awards?
# I can answer this using the grep() function 
# this searches for the word "Biochemistry" in the org column from the awards data.frame
# it creates a data.frame called biochemAwards
biochemAwards <- awards[grep("Biochemistry", awards$org),]
nrow(biochemAwards)  # answer 5

# or
# I can use my function countAwardType()
countAwardType("Biochemistry")

# returns this output:
# Dept.Type Award.Count Award.Type
# 1 Biochemistry           0       Gold
# 2 Biochemistry           1     Silver
# 3 Biochemistry           4     Bronze


# As a biochemist you might be willing to work in any "Bio" department
# these include "Biochemistry" of course. 
# or a "Chem" Department 
# or a "Medic" Department like I do...
# or perhaps a Pharmacology department "Pharm"

# lets make a list. 

biochemistry.options <- c("Biochemistry", "Bio", "Chem", "Medic", "Pharm", "Immun")

# loop through the list - make a data.frame for plotting
awardsDepts <- NULL
for(i in 1:length(biochemistry.options)){
  awardsDepts <- rbind(awardsDepts, countAwardType(biochemistry.options[i])) 
}

# re-order the data.frame
awardsDepts$Award.Type <- factor(awardsDepts$Award.Type, levels = c("Bronze", "Silver", "Gold"))

# let's draw a graph
# make the first version of the plot
p <- ggplot(awardsDepts, 
            aes(x=Dept.Type, 
                y=Award.Count, 
                fill=Award.Type)) + 
            geom_bar(stat="identity") +  # makes the barplot
     scale_fill_manual(values=c("#956C3E", 
                                "#757576", 
                                "#A28D30")) + 
     # colours as per AS website - see below
     xlab("") +  # no need for the "Depts" x-axis
     theme_few() # nice clean theme  

# I want to add a y-axis label
# and increase the size of the text
p <- p + ylab("Number of Awards") +
  theme(axis.title.y = element_text(size = 14 )) + 
  theme(axis.text = element_text(size = 14))

# make the legend a bit bigger and move it
p <- p + theme(legend.position=c(0.9,0.85), # top right
               legend.text=element_text(size = 12), # inc title size
               legend.title=element_text(size = 12)) # labels

# have a look at the bar chart
p + ggtitle("Departments Types with Athena SWAN Awards")

# Don't forget some of the quality Research Institutes: 
# Babraham and Institue of Cancer Research both have Silver Awards. 


# by way of contrast let's look at some other department types:
departments <- c("Bio", "Chem", "Medic", "Engin", "Comp", "Math", "Psych", "Physic")
awardsDepts <- NULL
for(i in 1:length(departments)){
  awardsDepts <- rbind(awardsDepts, countAwardType(departments[i])) 
}
awardsDepts$Award.Type <- factor(awardsDepts$Award.Type, levels = c("Bronze", "Silver", "Gold"))

# push in the new data
d <- p %+% awardsDepts
d + ggtitle("Departments Types with Athena SWAN Awards")







# we can also compare Institutions
russell <- c("University of Birmingham", "University of Bristol", "University of Cambridge", 
             "Cardiff University", "Durham University", "University of Edinburgh",
             "University of Exeter", "University of Glasgow", "Imperial College London",
             "King’s College London", "University of Leeds", "University of Liverpool",
             "London School of Economics and Political Science", "University of Manchester",
             "Newcastle University", "University of Nottingham", "University of Oxford",
             "Queen Mary, University of London", "Queen’s University Belfast", 
             "University of Sheffield", "University of Southampton", "University College London",
             "University of Warwick", "University of York")

awardsDepts <- NULL
for(i in 1:length(russell)){
  awardsDepts <- rbind(awardsDepts, countAwardType(russell[i])) 
}
awardsDepts$Award.Type <- factor(awardsDepts$Award.Type, levels = c("Bronze", "Silver", "Gold"))

# push in the new data
inst <- p %+% awardsDepts
inst + ylim(0,50) +  # extend the y axis
    theme(legend.position=c(0.1,0.85)) +   # move the legend
    theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5 )) +                    # vertical names
    ggtitle("Russell group Universities with Athena SWAN Awards")




# finally Who's got Gold?
awards[grep("Gold", awards$Level),]
Type New_Renew Level Year Month                                                       org
# 70  Dept   Renewal  Gold 2015 Apr University of York, Department of Chemistry
# 360 Dept       N/A  Gold 2013 Nov Queen’s University Belfast, School of Psychology
# 361 Dept       N/A  Gold 2013 Nov University of Cambridge, Department of Physics
# 362 Dept       N/A  Gold 2013 Nov University of York, Department of Biology
# 452 Dept       N/A  Gold 2013 Apr Imperial College London, Department of Chemistry
# 539 Dept       New  Gold 2012   Nov Queen’s University Belfast, School of Biological Sciences
# 540 Dept       New  Gold 2012   Nov School of Chemistry, University of Edinburgh


# colours from AthenaSWAN website
# https://athena-swan.medschl.cam.ac.uk/wp-content/uploads/2014/03/Athena-SWAN_style-guide_October-2012.pdf
# Bronze: R149 G108 B62
# Silver: R117 G117 B118
# Gold: R162 G141 B48
# Convert RGB to hex http://www.rgbtohex.net/




Tuesday 8 March 2016

Plan for VizBi Tutorial

Updated: I have updated this after the tutorial

Today we went through three scripts rather than the full five, I had planned.
For participants, I wrote this script because of people who were interested in ggplot2. It talks about faceting which is an important part of ggplot2.
This other script is a proteomic example that talks about clustering and write a volcano plot.  
In response to questions, I have a couple of scripts that do relatively simple image analysis:

ggplot script for VizBi 2016

I have updated the ggplot script for VizBi.

It's based on Dean Hammond's previous script and includes a loop - often seen as bad practice in R but it seems to work here.

It makes this:




# START
# The plot with the six lines is a bit busy. 
# Let's make a ggplot equivalent to the base R multiplot.
# Here we do it using faceting.
# To make the script free standing the reshaping code is included. 

setwd("/Users/paulbrennan/Documents/RforBiochemistsScripts/VizBiMar2016")

library(reshape2)
library(ggplot2)
library(ggthemes)

# some data:
enzdata <- matrix(c(0, 17.36667, 31.97143, 52.68889, 61.95385, 74.2, 77.97143, 84.28, 99.91429, 93.66667, 
                    0, 15.7, 29.42286, 45.64, 62.60615, 75.78118, 69.88, 75.256, 89.59429, 86.84, 
                    0, 27.10667, 42.12, 63.48, 69.56, 74.26857, 79.44444, 83.29091, 87.1, 82.08571, 
                    0, 24.72, 39.07, 47.4, 57.928, 67.6, 71.35556, 67, 75.79375, 70.86667, 
                    0, 5.723636, 11.48, 17.697143, 28.813333, 37.567273, 42.483077, 40.68, 52.81, 56.92, 
                    0, 2.190476, 5.254545, 8.95, 15.628571, 20.8, 25.355556, 26.55, 32.44, 33.333333),
                  nrow = 10, ncol = 6)
no.Exp <- c("Exp.1", "Exp.2", "Exp.3", "Exp.4", "Exp.5", "Exp.6")
Sub <- c(0, 1, 2, 4, 8, 12, 16, 20, 30, 40)

enzdata <- as.data.frame(enzdata) # converted to a data.frame for plotting
colnames(enzdata) <- rep("Exp", times=6)  # add a column name
enzdata <- cbind(Sub, enzdata) # add the Substrate data

# so we have a data.frame and we want to draw some graphs. 
# as is always the case with R, there are many ways to do this. 

# using our dataframe we can draw the graphs one by one...

p <-  ggplot(data = enzdata, aes(x = Sub, y = Exp)) +
  geom_point() +
  theme_few() +
  ylab("velocity (nmol/s)") + xlab("substrate (mM)") +
  ggtitle("Experiment 1")
p

# one of the useful feature of ggplot is that you can force new data into an old plot
# we have the data from Exp 6 in an object. 
# if we make the same shaped dataframe with another set of data we can force that in.
new.data <- data.frame(enzdata$Sub, enzdata[,3])
colnames(new.data) <- c("Sub", "Exp")  # col names must be the same
p <- p %+% new.data   # this %+% is what forces in the new data...
p <- p + ggtitle("Experiment 2")
p


# we can use this feature and put making and saving the graphs inside a loop
for(i in 1:6) {  
  new.data <- data.frame(enzdata$Sub, enzdata[,i+1])
  colnames(new.data) <- c("Sub", "Exp")  # col names must be the same
  p <- p %+% new.data
  p <- p + ggtitle(paste("Experiment", i))
  filename <- paste0("Experiment_", i, ".pdf")
  p + ggsave(filename)
}
# it would probably be better style to write a function and vectorise it!


## there is also an 'even better' way to do it using facets - a useful feature of ggplot
# requires us to change the structure of the dataframe
# melt the data from wide to long:
colnames(enzdata) <- c("Sub", "Exp.1", "Exp.2", "Exp.3", "Exp.4", "Exp.5", "Exp.6")

melted_data <- melt(enzdata, id.vars = "Sub", value.name = "v", variable.name = "Exp")

# Check out the differences between these two files:
view(enzdata)
view(melted_data)

# the way ggplot draws colours depends on the required number of colours. The way it does it is just with equally spaced hues around the color wheel, starting from 15:
gg_color_hue <- function(n){
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
  }
  
# we have 6 experiments (so we want to create our own 6 colour "mini-palette")
cols <- gg_color_hue(6)  
# this gets the default ggplot colours when we want to plot 6 different variables.


# 1st, create an object with the ggplot 
# It has all the data simply as points, shaped and colour-coded by expt. of origin (as above). 
# We store it in an object called 'x':
x <- ggplot(data = melted_data, aes(x = Sub, y = v)) +
  geom_point(aes(colour = Exp, shape = Exp)) +
  theme_few() +
  ylab("velocity (nmol/s)") + xlab("substrate (mM)") +
  theme(axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 15),
        axis.text.y = element_text(size = 15),
        plot.title = element_blank(),            # we don't want individual plot titles as the facet "strip" will give us this
        legend.position = "none",                # we don't want a legend either
        panel.border = element_rect(fill = NA, color = "darkgrey", size = 1.25, linetype = "solid"),
        axis.ticks = element_line(colour = 'darkgrey', size = 1.25, linetype = 'solid'))     # here, I just alter to colour and thickness of the plot outline and tick marks. You generally have to do this when faceting, as well as alter the text sizes (= element_text() in theme also)

# have a quick look at x
x
# see all the data in different colours

# Next, let's modify 'x' by faceting based on the Experiment Name, 
# We also specify that we want a panel 3 plots wide x 2 plots high.
# You can change this, obviously. Can also 'facet_grid' too:
x <- x + facet_wrap( ~ Exp, ncol = 3)

# Finally, let's apply the Michaelis Menten fitting to our faceted data, adding the best fit line in each case:
x <- x + geom_smooth(method = "nls",                 
                     method.args = list(formula = y ~ Vmax * x / (Km + x), 
                                        start = list(Vmax = 50, Km = 2)),
                     se = F, colour = 'black', size = 0.5)

# Finally, just call 'x' to show the plots:
x

# if you want to save it as a .pdf, just add
ggsave("GGPLOT_FACET_EnzKin.pdf"), when you call 'x'
x + ggsave("GGPLOT_FACET_EnzKin.pdf")


Monday 7 March 2016

Gene Expression Analysis and Visualization for VizBi 2016 (Pt 2)

Updated: I have updated this script after the VisBi Tutorial. I have added some comments which I hope explain things better.

This second part of the gene analysis script generates list of genes that are differentially expressed following drug treatment. The data set is the same as used in first part of the analysis script. It's on github and can be downloaded within R. I like the volcano plots which facilitate a nice understanding of the scale of changes caused by the different drugs.





# START
library(RCurl)
library(ggplot2)
library(heatmap3)

# this script uses the limma pacakge from Bioconductor
# these first two lines install it. 
# for more information about the package, go here:
# https://bioconductor.org/packages/release/bioc/html/limma.html
# it was created to apply linear modelling to microarray data 
# but it can be used for any sets of numbers where you have multiple groups
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)


x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/microArrayData.tsv")
data <- read.table(text = x, header = TRUE, sep = "\t")

new.data <- data[2:16]
# these are log2 values 
# The data has been normalized and summarized

# turn into a matrix 
data.m <- as.matrix(new.data)

## calculate the significant differences limma!
names <- colnames(data.m)
names <- factor(gsub("\\.\\d", "", names)) # change names into factor

design <- model.matrix(~ 0 + names)
design
# this looks about right - each treatment has three samples
# no overlap between the samples. 

# Fit linear model for each gene given a series of arrays
fit <- lmFit(object = data.m, design = design)
fit

# set up the experiment using makeContrasts
# makeContrasts
contrast <- makeContrasts(
  Drug_A = namesUT - namesDrug_A,
  Drug_B = namesUT - namesDrug_B,
  Drug_C = namesUT - namesDrug_C,
  Drug_D = namesUT - namesDrug_D,
  levels = design)
contrast

fit <- contrasts.fit(fit, contrast)

# Given a microarray linear model fit, compute moderated t-statistics, 
# moderated F-statistic, and log-odds of differential expression by 
# empirical Bayes moderation of the standard errors towards a common value.
fit.bayes <- eBayes(fit)
fit.bayes

# using topTable to extract gene lists
# this method adjusts for multiple testing using the Benjamini and Hochberg method
# for information on this see the limma documentation
# https://bioconductor.org/packages/release/bioc/html/limma.html or
# there is lots of information on the internet
# e.g.: https://en.wikipedia.org/wiki/False_discovery_rate
Drug_A.diff <- topTable(fit.bayes, n=5000, adjust.method = "BH", coef=1)
Drug_B.diff <- topTable(fit.bayes, n=5000, adjust.method = "BH", coef=2)
Drug_C.diff <- topTable(fit.bayes, n=5000, adjust.method = "BH", coef=3)
Drug_D.diff <- topTable(fit.bayes, n=5000, adjust.method = "BH", coef=4)

# rownames of these are the row names of data...
colnames(Drug_D.diff)
str(Drug_D.diff)  # it's a data.frame

# VOLCANO PLOT VISUALISATIONS

# we can draw a volcano plat because we have calculated the log 2 fold change and the p-value
Drug_A.diff$threshold = as.factor(Drug_A.diff$adj.P.Val < 0.01)
Drug_B.diff$threshold = as.factor(Drug_B.diff$adj.P.Val < 0.01)
Drug_C.diff$threshold = as.factor(Drug_C.diff$adj.P.Val < 0.01)
Drug_D.diff$threshold = as.factor(Drug_D.diff$adj.P.Val < 0.01)

##Construct the plot object
gA <- ggplot(data=Drug_A.diff, 
            aes(x=logFC, y =-log10(adj.P.Val), 
                colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  xlab("log2 fold change") + ylab("-log10 p-value") +
  theme_bw() +
  theme(legend.position="none")

gA <- gA + ggtitle("Drug A")
gA     # shows us the object - the graph

# this %+% allows us to push a different dataset into an existing 
# ggplot object. It overwrites the data
# it must contain the same column names otherwise it won't work.
gB <- gA %+% Drug_B.diff

# then we need to give it a new title
gB <- gB + ggtitle("Drug B")

# do the same for the other Drugs
gC <- gA %+% Drug_C.diff
gC <- gC + ggtitle("Drug C")
gD <- gA %+% Drug_D.diff
gD <- gD + ggtitle("Drug D")

# These visualisations work.
# no significant differences with Drug A, more with Drug B and C
# lots of differences with Drug D.

# run code below to make multiplot function first!
multiplot(gA, gB, gC, gD, cols=2)



## ANOTHER VISUALISATION: a heatmap

# merge Drug_D adjusted P value and LogFC with expression values...
Drug_D.diff$probeset_id <- rownames(Drug_D.diff)
merged.data <- merge(data, Drug_D.diff, by="probeset_id")

# subset merged data for significantly altered genes
data.2.heatmap <- subset(merged.data, adj.P.Val < 0.001)
dim(data.2.heatmap)

# get the numbers and turn into a matrix
data.2.heatmap.num <- data.2.heatmap[2:16]  # just the numbers
data.2.heatmap.m <- as.matrix(data.2.heatmap.num)

# if you make your plot window big, this will work
heatmap3(data.2.heatmap.m,
         RowSideLabs = FALSE, 
         showRowDendro=FALSE,
         showColDendro=FALSE,
         main = "Heatmap of Drug Effects")

# otherwise print it to a PDF file. 
pdf("mygraph.pdf")
heatmap3(data.2.heatmap.m,
         RowSideLabs = FALSE, 
         showRowDendro=FALSE,
         showColDendro=FALSE,
         main = "Heatmap of Drug Effects")
dev.off()


# It's not perfect but it's the best I can do at the moment.




# from: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Gene Expression Analysis and Visualization for VizBi 2016 (Pt 1)

UPDATE: During the VizBi2016 tutorial session, the participants noticed a couple of errors in the script. I think these have now been corrected.
Also, Christian Hauer suggested that we visualise the distance matrix using the heatmap packages. He suggested pheatmap, in particular. Thanks Christian.

I have spent some time creating two quite long scripts that generate a selection of visualisation of some gene expression data generated recently by Mel Boyd at the Cardiff CLL Research Group. The script illustrates a workflow that results in a selection of interesting and useful visualisations that allow us to compare the different drug treatments used.

The data was generated from an Affymetrix HTA 2.0 chip. The data has been normalised and mapped using the oligo package. I have focussed on the protein coding transcripts and I have generated a random selection of 5000 of these genes.

This part of the experiment uses the data from four different experimental 'drugs' which target different pathways and have different cellular effects. This analysis and the associated visualisations are used to allow us to compare the drug treatments and look at the data as a whole set.

I am happiest with the visualisation that is generated by the loadings of the principal component analysis but there are some other good visualisations generated by the scripts too.
The cluster at the top right is from the drug that causes the most changes. The other drugs cause few changes and so map closer to the untreated samples.

Next script shows volcano plots and a heatmap.


Here is the script:
# START 
# download the data from github
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/microArrayData.tsv")
data <- read.table(text = x, header = TRUE, sep = "\t")

# basic commands for looking at the data
head(data)      # view first six rows
str(data)       # shows it's a dataframe
colnames(data)  # column names
View(data)      # this works in R-Studio

# for first part of analysis
# subset of the data with just the numbers.
new.data <- data[2:16]
# these are log2 values 
# The data has been normalized


## FIRST visualisation of the data: a simple box plot
boxplot(new.data, 
        las =2, # las = 2 turns the text around to show sample names
        ylab = expression(bold("expression")),
        main="Boxplot of expression data")  




## SECOND visualisation(s): calculate and visualise distances
# turn into a matrix 
data.m <- as.matrix(new.data)
# transpose the data because a distance matrix works in rows
data.m.t <- t(data.m)
# calculate the distances and put the calculations into an object called distances
distances <- dist(data.m.t)
# convert this distances object into a matrix. 
distances.m <- data.matrix(distances)
# you can look at this object.
View(distances.m)
# we can extract the size of the object and the titles
dim <- ncol(distances.m)
names <- row.names(distances.m)
# now to create the visualisation of the difference matrix. 
# first the coloured boxes
image(1:dim, 1:dim, distances.m, axes = FALSE, xlab = "", ylab = "")
# now label the axis
axis(3, 1:dim, names, cex.axis = 0.8, las=3)
axis(2, 1:dim, names, cex.axis = 0.8, las=1)
# add the values of the differences
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", distances.m), cex=1)





# heatmaps ccan also be used to visualise the distance matrix. 
# samples can be clustered or you can stop this. Thanks Christian Hauer
heatmap(distances.m)


heatmap3(distances.m)



pheatmap(distances.m)



## THIRD visualisation: use the distance matrix to do some clustering
plot(hclust(distances))


# three clusters - based on the distance matrix and shows the same thing really.

## FOURTH visualisation: principal component analysis
# Do a Principal Component Analysis  
# and widely-used technique for viewing patterns of gross variation
# in datasets
# We can see whether an array is substantially different to the others
pca <- princomp(data.m)  # function that does the PCA
summary(pca)  # one component accounts for 99% of the variance
plot(pca, type = "l")  




names <- factor(gsub("\\.\\d", "", names)) # change names into factor
plot(pca$loadings, col = names, pch = 19)
text(pca$loadings, cex = 0.7, label = colnames(new.data), pos =3)



# some of the treatment cluster well together (e.g. Drug D) others not so much


# can be interesting to look at a subset of the data...
pca <- princomp(data.m[,1:12])
plot(pca$loadings, col = names, pch = 19, 
     main = "Just three drug treatments")
text(pca$loadings, cex = 0.7, label = colnames(new.data)[1:12], pos =3)

# basically happy with our data generally. 
# next step in next script is to look at differentially expressed transcripts.