Showing posts with label publication. Show all posts
Showing posts with label publication. Show all posts

Thursday, 30 September 2021

Data visualization with the programming language R - published.

Today, I published a piece in The Biochemist about data visualisation with R. I enjoyed writing the piece and it includes some of the code and blog posts that I have written here. 

The code I used to create the figures has all been uploaded on Github. To reproduce the figures you can cut and paste the code from there into a script window on R-Studio. 

Here is a list of pages with code for the figures. 

Figure 1 - R allows reproducible data visualisations. 


Figure 2 - Three most viewed data viz from this blog

Figure 3 - Tidy Tuesday data visualisations
  • Figure 3A - A volcanic activity time line inspired by @ijeamaka_a
  • Figure 3B - Illustrating the importance of numbers in password strength. Across a range of password types, inclusion of numbers increases password strength.
  • Figure 3C (Github only) - Showing the proportion of female culprits in Scooby Doo shows from 1960s to 2020s.
Figure 4 - Showcasing drawProteins (below)



There are lots more scripts on this blog. I hope you find it useful for learning R. 


Do reach out if you need any help or if some of the code doesn't work. Either comment here, contact me through Github or twitter: @brennanpcardiff


Friday, 20 January 2017

Colouring my kinome phylogenetic tree...

I have been playing with the kinome phylogenetic tree that I made last week inspired by the paper in Science from Manning et al in 2002. The first thing I want to do is to add color to the lines. This has proved a bit more difficult than I had anticipated and was a steep learning curve. It reminds me that data visualisation is a real combination of art, science, coding, time and effort.

Here is a coloured kinome tree that I am kind of happy with for the moment. I think there is some improving still to do but it's nicer than the one coloured version I made before.




The code below uses a very manual approach to manipulating trees. There are various reasons for that and these include the fact that I want to make an unrooted tree. There are ways to manipulate trees using other packages including phytools and ggtree. However, I'm not able to get them to work well with unrooted trees. There is still more to learn....

Here is the code that I used to draw this and some of the trees that I made along the way:

START
library(ape)
library(RCurl)

# a tree was made using code in a previous blog post. 
# see here: http://rforbiochemists.blogspot.co.uk/2017/01/visualizing-kinome-in-r-simple-tree.html
# it's here on github
link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/phyloTrees/kinaseTree_20161221"

# this will download it into R. 
# using read.tree() from ape package
tree <- read.tree(file = link)

# this tree looks quite nice in my opinion and is the starting point of this blog 
plot(tree, "u", 
     use.edge.length = FALSE,
     show.tip.label = FALSE,
     edge.color = "red")
# but lacking colours for the groups and any labels...




# it seems more traditional to draw trees from left to right. 
plot(tree, 
     use.edge.length = FALSE,
     show.tip.label = FALSE,
     edge.color = "red")

# with some names... this is slow to draw due to the names 
plot(tree, 
     use.edge.length = FALSE,
     edge.color = "red",
     cex = 0.25)


# and you can't read the names because there are >500




# to customise this tree in a way we want we need to understand a little more about trees
# we can find out more about an object by writing the name
tree
# "Phylogenetic tree with 516 tips and 514 internal nodes"

# by using the class() function
class(tree)
# "phylo"

# or by using the str() structure function
str(tree) 
# "List of 4"
# this list includes $edge, $Nnode, $ tip.label and $edge.length
# the tree$tip.label includes family designation
tree$tip.label  # 516 of these

# from the Science paper, we have seven kinase families:
# kinase categories... TK, TKL, STE, CK1, AGC, CAMK, CMGC
# with the following colours
# "red", "green", "paleblue", "orange", "yellow", "purple", "pink", "green" 


# by using the grep()function on the tree$tip.label part of the object
# we can find the tip labels that include "TK/" - i.e. tyrosine kinases
grep("TK/", tree$tip.label)  # gives a list of numbers with "TK/" in tip label
length(grep("TK/", tree$tip.label))
# thus there are 94 tip labels with that are designated TK (not TKL tyrosine kinase like)

# make a vector for each tip.label called tipcol with black on all of these...
tipcol <- rep('black', length(tree$tip.label))

# make a vector with our list of kinase categories
kinaseCats <- c("TK/", "TKL", "STE", "CK1", "AGC", "CAMK", "CMGC", "RGC")

# make a vector of color we want:
colorsList <-c("red", "darkolivegreen3", "blue", "orange", "yellow", "purple", "pink", "green")

# replace colours where grep gives "TK" as red, etc in a loop
for(i in 1:length(kinaseCats)){
  tipcol[grep(kinaseCats[i], tree$tip.label)] <- colorsList[i]
}

# plot with edge length false to see nodes better
plot(tree, 
     use.edge.length = FALSE,
     tip.color=tipcol, 
     cex = 0.25)
# slow to draw due to text - a bit annoying!


Kinome tree with different coloured labels for different kinds of kinases. Tyrosine kinases are in red. 

# trees are made up of nodes and edges. 
# its possible to label nodes using nodelabels() function from ape package
nodelabels(cex=0.4)
# labels internal nodes. 
Internal nodes of the tree are labelled with the number

# the only way seems to identify the relevant nodes manually

# i.e. the nodes that include the kinase groups that we have coloured
# from the bottom
# for 1st "green" looks like node 574
# for "red" looks like node 607
# for 2nd "green" somethink like 701 but very difficult to see
# for "purple" most of 749 and also north of 726 but I can't read the number
# "blue" node 723
# "yellow" node 885, I think
# "pink" node 955, I think
#  not perfect but getting there....

# adding edge colors
# from 111 to 177 should be green
# from 178 to 364 should be red. 
# from 459 to 577 should be purple
# from 578 to 608 should be purple too
# from 641 to 733 should be blue
# from 735 to 850 approx should be yellow
# from 876 to 980 should be pink
# http://stackoverflow.com/questions/34089242/phylogenetic-tree-tip-color
# make a vector for each edge called edgecol with black on all of these...
edgecol <- rep('black', nrow(tree$edge))
edgecol[178:364] <- "red" # "TK/"
edgecol[111:177] <- "green" # "TKL" OR "RGC"
edgecol[641:733] <- "blue" # "STE"
edgecol[1003:1029] <- "orange" # "CK1"
edgecol[735:850] <- "yellow" # "AGC"
edgecol[459:577] <- "purple" # "CAMK"
edgecol[578:608] <- "purple" # "CAMK"
edgecol[876:980] <- "pink" # "CMGC"


plot(tree,
     use.edge.length = FALSE,
     tip.color=tipcol, 
     edge.color = edgecol,
     cex = 0.25)


Kinome Tree with text and branches coloured. 





plot(tree, "u",
     use.edge.length = FALSE,
     tip.color=tipcol, 
     edge.color = edgecol,
     cex = 0.25)


# plot.phylo() function from ape package allows rotation of tree. 
plot.phylo(tree, "u", 
           use.edge.length = FALSE,
           edge.color = edgecol,
           rotate.tree = -95,
           show.tip.label = FALSE)



# want to add some names at the ends of the branches
# try to find out some tip numbers using
tiplabels(cex=0.3)



# add labels to node 246, 105, 191 and 340
tree$tip.label[246] # "CaMK1d_Hsap_-CAMK/CAMK1"
tree$tip.label[105] # "EphA3_Hsap_-TK/Eph"
tree$tip.label[191] # "TGFbR1_Hsap_-TKL/STKR/Type1"
tree$tip.label[340] # "TNIK_Hsap_-STE/STE20/MSN"

# add these to a list
kinaseLabels <- c("IKKa","ErbB2", "MLK1", "PKCb", 
                  "CDK9", "CaMK1d_Hsap_-CAMK/CAMK1",
                  "EphA3_Hsap_-TK/Eph", "TGFbR1_Hsap_-TKL/STKR/Type1",
                  "TNIK_Hsap_-STE/STE20/MSN")

# extract tip.labels
tipLabels <- tree$tip.label

# find these in the alignment - they will be tip labels
labelNo <- NULL
for(i in 1:length(kinaseLabels)){
  labelNo <- c(labelNo, grep(kinaseLabels[i], tree$tip.label))
}
# generates a vector of 9 numbers. Some labels in two names. 

# make a vector of blank tiplabels
tipLabels_2 <-  rep('', length(tree$tip.label))

# add the labels we want to the vector...
for(i in 1:length(labelNo)){
  tipLabels_2[labelNo[i]] <- tipLabels[labelNo[i]]
}

# make a new tree
tree_fewLabels <- tree

# replace the tip labels with the shorter list.
tree_fewLabels$tip.label <- tipLabels_2

# remove "Hsap" 
tipLabels_2 <- gsub("Hsap", "", tipLabels_2)
tree_fewLabels$tip.label <- tipLabels_2

# plot.phylo() function from ape package allows rotation of tree. 
plot.phylo(tree_fewLabels, "u", 
           use.edge.length = FALSE,
           edge.color = edgecol,
           rotate.tree = -95,
           show.tip.label = TRUE,
           cex = 0.4)





# add a title and source
plot.phylo(tree_fewLabels, "u",
     main="Phylogenetic tree of human kinase domains",
     sub="source: www.kinase.com & Manning et al Science (2002) 398:1912-1934",
     rotate.tree = -95,
     use.edge.length = FALSE,
     edge.color = edgecol,
     show.tip.label = TRUE, font = 2,
     cex = 0.5)




# this looks quite good and is enough for today. 







Friday, 6 January 2017

Visualizing the kinome in R - a 'simple' tree...

A paper by Manning et al in Science in 2002 showed an phylogenetic tree of the kinases in the human genome. With the help of a beautiful poster and web resources by Cell Signaling Technologies, this visualisation has become a classic among researchers working on protein kinases.

It has been used in many ways and the paper itself has be very extensively cited.
Examples include:
This blog post uses the sequences available on the protein kinase website (kinase.com) to create a tree using R. My hope is that this will be the first of few posts that will develop into being able to reproduce the visualisation from the Science and to render a similar image to that created by Cell Signaling Technology. This is really a key objective for me as I want to use this visualisation for my teaching.

If you are interested, please test the code and make comments. This is an ongoing project and I welcome feedback.

So here is the first version of the kinome phylogenetic tree:




I have made a more simple phylogenetic tree of the human proteins with the rel homology domain here which talks about the complexity of making trees.



Here is the script that makes this:
SCRIPT START

library(seqinr)  
library(msa)
library(ape)

# first version of generating a the kinome visualisation in R
# data is here: http://kinase.com/kinbase/FastaFiles/Human_kinase_domain.fasta
# in fasta format...
# need to extract the data into R... 
# need Biostrings package - downloaded as part of seqinr package

file <- c("http://kinase.com/kinbase/FastaFiles/Human_kinase_domain.fasta")

# step 1 is read in the FASTA files. 
kinases <- readAAStringSet(file, format = "fasta")
# that seems to work. 
kinases
# 516 sequences. 
# good.  

# step 2 do the multiple sequence alignment
kinaseAlign <- msa(kinases)  
# takes a bit of time!  - about 3.5 min on my computer...
# currently using default substitution matrix and CLUSTALW 
# creates an object with Formal class 'MsaAAMultipleAlignment' [package "msa"] with 6 slots

# step 3: convert Msa Alignment object into alignment for seqinr 
kinaseAlign2 <- msaConvert(kinaseAlign, type="seqinr::alignment")
class(kinaseAlign2) # it's an alignment
# worked 
# List of 4

# step 4: compute distance matrix - dist.alignment() function from the seqinr package:
d <- dist.alignment(kinaseAlign2, "identity")
# Class 'dist'

kinaseTree <- nj(d)  # from ape package, I think...
class(kinaseTree)
# class "phylo"
# List of 4

# good idea to save the tree locally... remove comment symbol
# write.tree(kinaseTree, file = "kinaseTree")
# to read back in:
# kinaseTree <- read.tree(file = "kinaseTree")

plot(kinaseTree, 
     main="Phylogenetic Tree of kinases")
# too difficult to read so remove the tip.labels which are the kinase names.

plot(kinaseTree, 
     main= "Phylogenetic Tree of kinases", 
     show.tip.label = FALSE)

plot(kinaseTree, 
     type = "unrooted", 
     main= "Phylogenetic Tree of kinases", 
     show.tip.label = FALSE)



# looks quite stylish a a little similar to visualisation in the Science paper
plot(kinaseTree, "u", 
     use.edge.length = FALSE,
     show.tip.label = FALSE)



# need to add colour
# argument is edge.color
plot(kinaseTree, "u", 
     use.edge.length = FALSE,
     show.tip.label = FALSE,
     edge.color = "red")





# want to add selected labels to give some orientation 
# extract tip.labels
tipLabels <- kinaseTree$tip.label

# add some labels we like to orient ourselves:
kinaseLabels <- c("IKKa","JAK3","ErbB2", 
                  "NEK11", "MLK1", "PKCb", 
                  "CDK9", "FRAP")

# find these in the alignment - they will be tip labels
labelNo <- NULL
for(i in 1:length(kinaseLabels)){
  labelNo <- c(labelNo, grep(kinaseLabels[i], kinaseTree$tip.label))
}
# generates a vector of 9 numbers. Some labels in two names. 

# make a vector of blank tiplabels
tipLabels_2 <-  rep('', length(kinaseTree$tip.label))

# add the labels we want to the vector...
for(i in 1:length(labelNo)){
  tipLabels_2[labelNo[i]] <- tipLabels[labelNo[i]]
}

# make a new tree
kinaseTree_fewLabels <- kinaseTree

# replace the tip labels with the shorter list.
kinaseTree_fewLabels$tip.label <- tipLabels_2

# make the plot with these
plot(kinaseTree_fewLabels, "u", 
     use.edge.length = FALSE,
     show.tip.label = TRUE,
     edge.color = "red",
     cex = 0.7)







# remove "Hsap" using gsub() function 
tipLabels_2 <- gsub("Hsap", "", tipLabels_2)
kinaseTree_fewLabels$tip.label <- tipLabels_2
plot(kinaseTree_fewLabels, "u", 
     use.edge.length = FALSE,
     show.tip.label = TRUE,
     edge.color = "red",
     cex = 0.7)





# add a title and source and you have the image at the top...
plot(kinaseTree_fewLabels, "u",
     main="Phylogenetic tree of human kinase domains",
     sub="source: www.kinase.com & Manning et al Science (2002) 398:1912-1934",
     use.edge.length = FALSE,
     show.tip.label = TRUE,
     edge.color = "red",
     cex = 0.7)

# this looks quite good and is enough for today. 


Friday, 11 September 2015

Downloading and manipulating published proteomic data...

Update: 1 July 2025 - so a lot has happened in 10 years. This includes people moving jobs, promotions and the reorganisation of data on published website. 

Aled was promoted to Professor at Cardiff University 

----

There are many ways to get data into R. I want to illustrate a method of downloading published data within R, opening the data (an Excel file) and then doing visualisations and manipulations. 

I have chosen a paper from Molecular and Cellular Proteomics which uses aptamers to detect multiple proteins in exosomes and cells. 

The data was generated by colleagues that were working in the School of Medicine at Cardiff University including the first author - Dr Jason Webber, a Prostate Cancer UK funded Research Fellow and senior author, Professor Aled Clayton, a Senior Lecturer in Cancer & Genetics at Velindre Hospital. Dr Tim Stone was key to the data analysis. The protein detection method is from a company called SomaLogic

The first step was downloading the file from the Molecular and Cellular Proteomics website. I used the download.file() function. This saves the file into your current directory. This was opened using the readxl package (by Hadley Wickham) using the read_excel() function. 

I drew some graphs as I explored and manipulated the data. These are interspersed with the script below.  The visualisations included boxplots and a cluster diagram. 

I wanted to draw a volcano plot which expresses the fold change against the significant of the change (p-value). I couldn't do that with the data supplied so I had to reverse the transformation and calculate the fold change again. 

Here is the volcano plot:
Comparison of changes in exosomes compared cells. Proteins over-expressed in exosomes are on the right. Proteins over-expressed in cells are on the left. 

The plot indicates that there are more proteins over-expressed in cells (on the left) compared to exosomes (on the right). 

Update: 1 July 2025 - orignally, I was able to download the data directly from the Molecular and Cellular Proteomics website. However, at some point, they reorganised their site and put all the data into a zip file. This means that downloading it requires multiple steps outside of R. To make this analysis more stand alone, I have down loaded the data from this zip file and uploaded the spreadsheet onto my Github site. This means that the script will download and analyse the data. 

Here is the script with some other plots along the way:

START
# pull down a file from the internet, do some analysis and draw a graph...
# choose Jason Webber's MCP Paper...
# the data can be downloaded using this link with give zip file. 
# It isn't necessary to do that for this script. 
# install if necessary:
# install.packages(c("ggplot2", "readxl")) 
library(ggplot2)
library(readxl)


# this is the link to the data
link <- "https://github.com/brennanpincardiff/RforBiochemists/raw/master/R_for_Biochemists_101/data/mcp.M113.032136-6.xlsx"

# the download.file() function downloads and saves the file with the name given
download.file(url=link,destfile="file.xlsx", mode="wb")

# then we can open the file and extract the data using the read_excel() function. 
data<- read_excel("file.xlsx")

View(data)

# plot the data - always an important first step!
boxplot(data[2:7], 
        las =2, # las = 2 turns the text around to show sample names
        ylab = expression(bold("expression")),
        main="Boxplot of expression data")  


Two types of sample - exosomes (n=3) and cells (n=3).

# do a cluster analysis to quality control the different groups
# convert to matrix first
data.m <- as.matrix(data[2:7])
dim(data.m)   # gives the dimensions of the matrix
# ans: 762   6

# calculate the distances using the dist() function. 
# various methods are possible - default is Euclidean. 
distances <- dist(data.m)
summary(distances) # have a look at the object

# make the cluster dendrogram object using the hclust() function
hc <- hclust(distances) 
# plot the cluster diagram
# some interesting groups in the data
plot(hc, 
     xlab =expression(bold("All Samples")), 
     ylab = expression(bold("Distance")))



# ah, not what I intended. 
# it clustered the proteins NOT the samples. 

# transpose the data and try again...
data.m.t <- t(data.m)
dim(data.m.t)
# ans = 6  762 - so that has worked. 
# repeat cluter analysis

# calculate the distances using the dist() function. 
# various methods are possible - default is Euclidean. 
distances <- dist(data.m.t)
summary(distances)

# make the cluster dendrogram object using the hclust() function
hc <- hclust(distances) 
# plot the cluster diagram
# some interesting groups in the data
plot(hc, 
     xlab =expression(bold("All Samples")), 
     ylab = expression(bold("Distance")))
# replicates cluster together well. 





# the adjusted P value are in a column entiteld: BH - P.value
# this is a little awkward so rename this and Fold Change column:
colnames(data)[9] <- "P.Value"
colnames(data)[10] <- "Fold.Change"
plot(data$P.Value ~ data$Fold.Change)



# this works but we would like to turn it into a volcano plot 
# with log2 at the bottom and -p-value. 
# we can't log fold changes as half of them are negative numbers 
# we need to re-calculate the raw data
# reverse the log2 transformation. 
# mean the values & calculate fold change in decimal format (no +/-)
data$exo1 <- 2^data$exoRFU1
data$exo2 <- 2^data$exoRFU2
data$exo3 <- 2^data$exoRFU3
data$cell1 <- 2^data$cellRFU1
data$cell2 <- 2^data$cellRFU2
data$cell3 <- 2^data$cellRFU3

# calculate means of the replicates
data$exoMean <- rowMeans(data[,13:15])
data$cellMean <- rowMeans(data[,16:18])

# always good to visualise the data:
plot(log2(data$exoMean)~log2(data$cellMean))


# calculate fold change exo/cell
data$FoldChange <- data$exoMean/data$cellMean
plot(log2(data$FoldChange)) # transform for plotting


# make column in data.frame with transformed data
data$Log2.Fold.Change <- log2(data$FoldChange)

## Identify the genes that have a p-value < 0.05
data$threshold = as.factor(data$P.Value < 0.05)


## Construct the volcano plot object using ggplot
g <- ggplot(data=data, 
            aes(x=Log2.Fold.Change, y =-log10(P.Value), 
                colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  xlim(c(-6, 6)) +
  xlab("log2 fold change") + ylab("-log10 p-value") +
  theme_bw() +
  theme(legend.position="none") + 
  ggtitle("Volcano Plot comparing protein expression in exosomes vs cells ")  # add a title
  
g # show the plot 

END of script

So I think that the threshold of p<0.05 is too low for this volcano plot. It's relatively easy to change and would make a good exercise. Perhaps a threshold of p<0.00001 would be better. 




Useful resources (updated 1 July 2025)