A site to help Biochemists learn R.

Starting points

Friday, 22 June 2018

Creating a drawProteins hex sticker...

Lots of Bioconductor packages have created hex stickers - those fun stickers that go on the top of your laptop. They are a bit nerdy but, hey, who doesn't like nerdy 😀

I thought I should create a hex sticker for drawProteins since it's now available in Bioconductor.

It's been created totally in R using the hexSticker package from the talented GuangchuangYu.

Here is the sticker:


Below is the R code that makes the sticker. I've chosen to show three of NF-kappaB proteins but the code could be adapted to illustrate your favourite protein. You can also customise the sticker colour and lots of other things.

START
# script to create drawProteins sticker
# setwd("/Users/paulbrennan/Documents/drawProteins_hexsticker")
library(ggplot2)
# install.packages("hexSticker")
library(hexSticker)
library(drawProteins)

data("five_rel_data")
p <- draw_canvas(five_rel_data)
p <- draw_chains(p, five_rel_data, label_chains = FALSE, size =0.2)
p <- draw_domains(p, five_rel_data,
  show.legend = FALSE,
  label_size = 2)
p

p <- draw_phospho(p, five_rel_data)
p <- p  + theme_void()

sticker(p, package="drawProteins", p_size=6, s_x=.98,
  s_y=.88, s_width=1.4, s_height=1.1,h_fill = "#114466", h_color="#001030")


# this doesn't quite work :-(
# size of text is wrong... no labels... cramped!
# I think five proteins is probably too many...

# reduce to three....
# P19838
# Q04206
# Q04864

# get from UniProt rather than using internal data
prot_data <- drawProteins::get_features("Q04206 Q04864 P19838")

# turn the data into a dataframe
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# create canvas
p <- draw_canvas(prot_data)

# customise the labels
p <- draw_chains(p, prot_data, label_size = 1.4,
  labels = c("c-Rel",
    "p50/p105",
    "p50/p105",
    "p65/Rel A"),
  size=0.2)

# draw domains without a legend or labels
p <- draw_domains(p, prot_data, show.legend = FALSE, label_domains = FALSE)

# add text and control size with geom_text rather than labels from drawProteins
p <- p + ggplot2::geom_text(data = prot_data[prot_data$type == "DOMAIN", ],
  ggplot2::aes(x = begin + (end-begin)/2,
               y = order,
               label = description),
           size = 1.3)

# keep phosphorylation sites small
p <- draw_phospho(p, prot_data, size = 1)

# a completely blank theme using theme_void()
p <- p  + theme_void()
p

# make the sticker using the ggplot object entitled p
# creates a png file
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org")
# automatically creates the PNG file with the name drawProteins.


# make the PDF
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org",
  filename="drawProteins.pdf")

END


Some Resources:

For more help, bug reports or to suggest features


Tuesday, 13 February 2018

Plotting disease causing myosin variants using drawProteins

My drawProteins package has been accepted by Bioconductor. It's currently available as a development version -  a nice conclusion of lots of effort.

Last week, I had my first query about drawProteins from Kevin T. Booth, a graduate student at the University of Iowa. Kevin wanted to plot disease causing protein variants.

Kevin's working on the MYO7A gene (Uniprot ID “Q13402”). Kevin told me that "alterations in MYO7A can result in 3 different phenotypes-Autosomal Dominant non-syndromic hearing loss (DFNA11),  Autosomal Recessive non-syndromic hearing loss (DFNB2) or Usher Syndrome Type 1 (USH1B), another type of inherited deafness.

Looking at the Uniprot page for MYO7A gene, it contains a long list of disease associated variants. We created a visualisation of these variants using the geom_point() function to apply a ggplot2 layer to a schematic of the protein.




Here is the script that generated this and a few other images on the way. Feedback welcome.

SCRIPT START
# exploring plotting variants
# using data from Q13402

# download the protein data first...

library(drawProteins)
library(ggplot2)
# pulling down data for Kevin's protein - MYO7A
prot_data <- drawProteins::get_features("Q13402")
# produce data frame
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# make protein schematic
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p





# nice domain structure
# note that it is a long protein > 2000 amino acids. 

p <- p + theme_bw(base_size = 14) + # white background
    theme(panel.grid.minor=element_blank(), 
        panel.grid.major=element_blank()) +
    theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
    theme(panel.border = element_blank())
p




# EXPLORE VARIANTS
# many VARIANTS
# count them...
nrow(prot_data[prot_data$type == "VARIANT",])
# ans = 74

# these can be plotted using geom_point() function
p + geom_point(data = prot_data[prot_data$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25))


# wanted focus on disease variants and colour the by disease...
# so need to change the info...
# colour or shape by info within it...

# duplicate prot_data
prot_data2 <- prot_data
# duplicate description..
prot_data2$disc2 <- prot_data$description
# remove word "in"
prot_data2$disc2 <- gsub("in ", "", prot_data2$disc2)

# pull out the disease name...
library(stringr)
temptable <- str_split_fixed(prot_data2$disc2, ";|:", 2)
prot_data2 <- cbind(prot_data2, temptable)
# name columns
tempnames <- colnames(prot_data2)
tempnames[11] <- "disease"
colnames(prot_data2) <- tempnames

# add colour argument within aesthetics
p + geom_point(data = prot_data2[prot_data2$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25,
        colour = disease))



# exclude some of the VARIANTS as not disease causing..
# dbSNPs, rare polymorphism and the blank one 
prot_data2$disease[prot_data2$disease=="dbSNP"] <- NA
prot_data2$disease[prot_data2$disease==""] <- NA
prot_data2$disease[prot_data2$disease=="rare polymorphism"] <- NA

# clean up variant list for easier legend
disease_var_data <- prot_data2[prot_data2$type == "VARIANT",]
disease_var_data <- na.omit(disease_var_data)
disease_var_data$disease <- gsub("found patients with ", 
    "", disease_var_data$disease)
disease_var_data$disease <- gsub("found a patient with ", 
    "", disease_var_data$disease)

# add colour argument within aesthetics
p + geom_point(data = disease_var_data,
    aes(x = begin,
        y = order+0.25,
        colour = disease))

# over plotting is an problem 
# enlarge canvas and explore geom_jitter()
p <- p + ylim(0.75, 3)
p <- p + geom_jitter(data = disease_var_data,
            aes(x = begin,
                y = 2,
                colour = disease),
            width = 0, height = 0.75, 
            size = 2)


# add titles
myo7A_subtitle <- paste0("circles = Disease Variants\n",
    "source: Uniprot accessed 8 Feb 2018")

p <- p + labs(title = "Schematic of human Unconventional myosin-VIIa",
    subtitle = myo7A_subtitle)
p


SCRIPT END

Some Resources

For more help, bug reports or to suggest features



Monday, 13 November 2017

Plotting some real chromatography data

Diane Hatziioanou who's working through R for Biochemists 101 from the Biochemical Society shared some of her chromatography data with me. 
Diane has published a similar plot in the journal, Anaerobe

Here is the plot made in ggplot2:


The plot is made without the data reformatting that I did in the previous example. Each line is added separately, scaled and text added and positioned manually. The is no legend. 

The curves are smoother because there is lots more data points. 

The secondary axis on the right hand side of the plot is in response to a question from Diane and is created using this code:

p +   scale_y_continuous(sec.axis = sec_axis(~./4, 
    name="Conductivity (mS/cm)"))

The tilda "~./4" part of this code modifies the numbers on the secondary axis. In this case, it divides them by four. We also supply a name for the axis. 

The data is downloaded from Github and is supplied with Diane's permission - thanks Diane. 

Here is the script that makes it and the intermediate plots:

START
library(readxl)
library(ggplot2)

# Hatziioanou, D, Mayer MJ, Duncan, SH, Flint, HJ & Arjan Narbad, A 
# (2013) Anaerobe 23:5-8
# http://dx.doi.org/10.1016/j.anaerobe.2013.07.006

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/20100212roseburiab.xls"
download.file(url=link, destfile="file.xls", mode="wb")

data_2 <- read_excel("file.xls", skip = 2)

colnames(data_2) <- c("vol_ml", "UV_mAU", "Cond_ml_11_1", "conduct_mS_cm",
  "Cond_11_%_ml", "percent_%", "Conc_11_ml", "percentB_1", "percentB_2",
  "Pressure_ml", "Pressure_ MPa", "Flow_ml", "Flow_ml_min", "Temp_ml", 
  "Temp_C")

# published data is column 2 vs column 7
ggplot(data_2, aes(x = Conc_11_ml, y = UV_mAU)) + geom_line()
# Warning message - missing values

# row 602 gives a volume of almost 20 mls so limit the plot to that
data_2[602,7]
# 19.993

# subset the data - just 602 rows
ggplot(data_2[1:602,], aes(x = Conc_11_ml, y = UV_mAU)) + geom_line()
# output looks good 
# very much like in the paper. 


# manual overlay and text in lots of layers to create a nice graph
# make a canvas and set the x axis
p <- ggplot(data_2[1:602,], aes(x = Conc_11_ml)) 

# add our first line with y = UV which is a measure of protein 
p <- p + geom_line(aes(y = UV_mAU)) 
p


# add our second line with %B - this is a chromatography concept
# we change the scale of the line by multiplying by 10
# define colour as red
p <- p + geom_line(aes(y = percentB_2*10), colour = "red") 
p



# add conductivity as our third line in blue
# we change the scale of the line by multiplying by 4
# define colour as blue
p <- p + geom_line(aes(y = conduct_mS_cm*4), colour = "blue")
p

# add coloured text to explain the lines
# x and y define the places the text arrive and label gives the text
p <- p +  geom_text(mapping = aes(x = 2, y = 400, label = "UV (mAU)"))
p <- p + geom_text(mapping = aes(x = 4, y = 50, label = "0% B"), colour = "red")
p <- p + geom_text(mapping = aes(x = 17.5, y = 1050, label = "100% B"), colour = "red")
p <- p + geom_text(mapping = aes(x = 7, y = 950, label = "Conductivity (mS/cm)"), colour = "blue")
p

# this gives the secondary axis on the right hand side
# the function sec_axis() with a correction factor to change the number
# in this case the correction factor is "/4" meaning divide by four
# we multiplied by four when we plotted the blue line. 
# we also give the scale a name - "Conductivity" and units. 
p <- p +   scale_y_continuous(sec.axis = sec_axis(~./4, 
    name="Conductivity (mS/cm)"))
p


# add a theme to change the background
p <- p + theme_bw()
p

# add labels, titles with a source for the data...
p <- p + labs(x = "Volume (ml)",
    y = "[Protein] (UV mAU)",
    title = "Chromatogram of FPLC purification",
    subtitle = "Source: Hatziioanou et al, (2013) Anaerobe 23:5e8")
p
END



Some Resources:




Thursday, 9 November 2017

Using drawProteins with BioMart to draw schematic of human MAP kinases...

Update (5th April, 2018): 'unigene' is not a valid attribute(s) so I have changed the code here.


Last week, I submitted my drawProteins R package to Bioconductor. One of the recommended things to do as a Bioconductor developer is to ensure that the package can work with other Bioconductor packages.

I've created this blog post to show how it's possible to use bioMart to pull out the UniProt accession numbers for the Gene Ontology (GO) term, "MAP kinase activity". This has a GO number of GO:0004707.

My script borrows heavily on the biomaRt users guide written by Steffen Durinck, Wolfgang Huber, Mike Smith. My thanks to them.

Using the R script below, I created this protein schematic:



START
# install bioMart if you haven't used it before
# remove the hash tags...
# source("http://www.bioconductor.org/biocLite.R ")
# biocLite()
# biocLite("biomaRt")
library(biomaRt)

# steps
# 1 choose a Mart with useMart()
# 2 choose a dataset with useMart()
# 3. make query with getBM()
# with
# A. filter  -  restriction on the query
# e.g. Interpro ID(s) [e.g. IPR000001]
# e.g. chromosome_name
# B. attributes  -  values we are interested in to retrieve.
# C. values - values you want to

listMarts()
# from https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html

# chosing a database = MART and a dataset - gets more focussed each step...
ensembl = useMart("ensembl",
                  dataset="hsapiens_gene_ensembl")

# Retrieve all entrezgene identifiers and HUGO gene symbols of genes which have
# a “MAP kinase activity” GO term associated with it.
# this is the GO:0004707
getBM(attributes = c('entrezgene','hgnc_symbol'),
      filters = 'go',
      values = 'GO:0004707',
      mart = ensembl)

# this is 14 proteins....

# create output in a dataframe and add uniprotswissprot
# which is the UniProt ID
output <- getBM(attributes = c('uniprotswissprot',
                               'hgnc_symbol'),
                filters = 'go',
                values = 'GO:0004707',
                mart = ensembl)


# returns a dataframe... pull out uniprotIDs for drawing...
uniprotIDs <- output$uniprotswissprot
# get rid of blank entries - turn into NA
uniprotIDs[uniprotIDs==""] <- NA
# remove NA
uniprotIDs <- na.omit(uniprotIDs)
# make the IDs characters
uniprotIDs <- as.character(uniprotIDs)
# just the unique ones
uniprotIDs <- unique(uniprotIDs)
# combine into one element
uniprotIDs <- paste(uniprotIDs, collapse = " ")
# this can now be used in drawProteins

# now get features from Uniprot
library(magrittr)

# install drawProteins from Github
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)

uniprotIDs %>%
  drawProteins::get_features() %>%
  drawProteins::feature_to_dataframe() ->
  prot_data
# data frame with 722 observations

library(ggplot2)
# basic drawing
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
draw_repeat(p, prot_data)




p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 4)

# background and y-axis
p <- p + theme_bw(base_size = 20) +  # white background and change text size
  theme(panel.grid.minor=element_blank(),
    panel.grid.major=element_blank()) +
  theme(axis.ticks = element_blank(),
    axis.text.y = element_blank()) +
  theme(panel.border = element_blank())

# add titles
p <- p + labs(title = "Schematic of human MAP kinases",
  subtitle = "circles = phosphorylation sites\nsource:Uniprot")


# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")

p
# AND it works!!

END





Resources and citations:

Monday, 6 November 2017

Making a chromatography plot with an axis on the right hand side...

While working through our online course, R for Biochemists 101 from the Biochemical Society, Diane Hatziioanou made the following comment:
Could you point me in the direction of help to make a plot with different axes? For example for a character series “fractions” make one plot with two sets of data, one plotted against the right and one against the left y axes where the two y axes could be something like absorbance and activity. Thanks!

This inspired me to create this visualisation:

If you more information, leave a comment below.

Here is the script:
START
#https://stackoverflow.com/questions/43942002/geom-line-with-different-y-axis-scale


library(magrittr)
library(tidyr)

sample_data <- data.frame(fracs = rep(1:20),
  Absorbance = c(0.1, 0.1, 0.1, 0.2, 0.4, 0.1, 0.1, 0.5, 1.0, 0.1,
    0.1, 0.1, 0.5, 0.1, 0.1, 0.2, 0.3, 0.5, 1.0, 0.1),
  Activity = c(32, 33, 35, 150, 287, 39, 35, 40, 42, 32,
    30, 31, 32, 45, 57, 32, 33, 40, 41, 40))

# reformat the data into long format to allow plotting
data <- sample_data %>% gather(y, value, Absorbance:Activity)
data$fracs <- as.integer(data$fracs)

# trial plot for Activity (enzyme activity)
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Activity")) 




# trial plot for Absorbance
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance"))



# plot the two lines together
ggplot(data, aes(fracs, value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance")) + 
  geom_line(data=data %>% filter(y=="Activity"))
# but different values makes Absorbance line flat. 





# change the scale for second line
max(sample_data$Absorbance)
max(sample_data$Activity)
scaleby = max(sample_data$Activity)/max(sample_data$Absorbance)

ggplot(data, aes(colour=y)) + 
  geom_line(data=data %>% filter(y=="Activity"),
    aes(x = fracs, y = value)) + 
  geom_line(data=data %>% filter(y=="Absorbance"),
    aes(x = fracs, y=value*scaleby))
# this looks better...


# putting it together with labels etc... 
p <- ggplot() + 
  geom_line(data=data %>% filter(y=="Activity"),
    aes(x = fracs, y = value, colour=y)) + 
  geom_line(data=data %>% filter(y=="Absorbance"),
    aes(x = fracs, y=value*scaleby, colour=y))

# limit for y axis
p <- p + ylim(0,300)

# adding our second scale on the right hand side of the plot
p <- p + scale_y_continuous(sec.axis = sec_axis(~./scaleby, 
    name="Absorbance (570nM)"))
  
# label the x and y axis...
p <- p + ylab("Enzyme activity (units)")
p <- p + xlab("Chromatography fractions")


# This is salt chromatography so adding a line showing 
# the NaCl concentration. 
salt_conc <- data.frame(fracs = rep(1:20),
  conc = c(c(10), seq(from = 10, to = 250, by= 13)))

# salt curve labels
salt_label <- data.frame(x = c(1,20), 
  y = c(10, 250), 
  text = c("[10nM]\nNaCl", "[250nM]\nNaCl"))

# check out the line
ggplot(salt_conc, aes(fracs, conc)) + 
  geom_line() + 
  geom_text(data = salt_label,
    aes(x = x,
      y = y,
      label = text))



# linetype=2 gives a dashed line
p <- p + geom_line(data = salt_conc, aes(fracs, conc),
  linetype = 2)
# add some labels to show the starting and ending concentrations
p <- p + geom_text(data = salt_label,
              aes(x = x,
                y = y,
                label = text),
                size = 2.5)

p <- p + theme_bw()
p <- p + ggtitle("Chromatography plot")
p <- p + theme(legend.position="top", legend.direction="horizontal")
p <- p + theme(legend.title=element_blank())
p





END

Thursday, 2 November 2017

Using drawProteins for draw NF-kappaB proteins...

Update, 4 Jan 2018: Following advice from Bioconductor, the package magrittr is not integral to drawProteins so needs to be activated separately to use this script. 

Update,  3 Nov 2017: I've renamed the functions e.g. geom_chains is now draw_chains and added a new function draw_canvas which allows one to draw without the chain, if you want.

Building on what I did previously, more work done on the drawProteins package allows the simplier script below to draw this protein schematic:


SCRIPT START
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
library(ggplot2)
library(magrittr)

# accession numbers of five NF-kappaB proteins
"Q04206 Q01201 Q04864 P19838 Q00653" %>%
  drawProteins::get_features() %>%
  drawProteins::feature_to_dataframe() ->
  prot_data

p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p <- draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 8)

# background and y-axis
p <- p + theme_bw(base_size = 20) +  # white background and change text size
  theme(panel.grid.minor=element_blank(),
        panel.grid.major=element_blank()) +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  theme(panel.border = element_blank())

# add titles
p <- p + labs(x = "Amino acid number",         # label x-axis
              y = "",  # label y-axis
              title = "Schematic of human NF-kappaB proteins",
              subtitle = "circles = phosphorylation sites\nRHD = Rel Homology Domain\nsource:Uniprot")


# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")

p
SCRIPT END

Resources