A site to help Biochemists learn R.

Starting points

Tuesday, 15 August 2017

Drawing NFkappaB protein schematics with ggplot2..

Update (2 Nov 2017): This still works but has been simplified with drawProteins package. See this blog page for the better example.

I've been busy with work and preparing materials for our R for Biochemists 101 online training course for the Biochemical Society. The materials are being tested now and the course should be available in October 2017. Then I had a short holiday...

So, it's taken a little longer than I hoped but here is the code to draw protein schematics with ggplot2. The schematic shows protein schematics of the five human NF-kappaB proteins. Here is the visualisation:

The code uses my drawProteins package which is on Github. This can be installed using the function install_github() from the package devtools. The required syntax is devtools::install_github("brennanpincardiff/drawProteins")

I've gone through the code step by step. It's quite long and I am planning to create some more functions that will shorten the code.

Here is the step by step code that uses drawProteins and ggplot2 to create a protein schematic visualisation.  The ggplots along the way are shown too. To see the plots best requires a plot window of about 900 pixels.

# working with rel proteins as I have been studying these for 10 year
# using ggplot2 to draw the plots.
# proteins are aligned to the N-termini
# alternative alignment options would be worth considering

# devtools::install_github("brennanpincardiff/drawProteins")

# Uniprot accession numbers of five human NF-kappaB family members
proteins_acc <- c("Q04206 Q01201 Q04864 P19838 Q00653")

# accession numbers need to be combined with "%2C" for Uniprot API
proteins_acc_url <- gsub(" ", "%2C", proteins_acc)

# this is the baseurl for a multiple features query
baseurl <- "https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession="

# make url to GET features for multiple accession numbers
url <- paste0(baseurl, proteins_acc_url)

# basic function is GET() which accesses the API
# accept_json() argument gives a JSON object
prots_feat <- GET(url,

status_code(prots_feat)  # returns a 200 - good

# extract content() - gives a list with length = number of acc no
prots_feat_red <- httr::content(prots_feat)

# loop to work through the API object and convert to data.frame
# probably there is a better way to do this
features_total_plot <- NULL
for(i in 1:length(prots_feat_red)){
  # the extract_feat_acc() function takes features into a data.frame
  features_temp <- drawProteins::extract_feat_acc(prots_feat_red[[i]])
  features_temp$order <- i  # this order is needed for plotting later
  features_total_plot <- rbind(features_total_plot, features_temp)

# look at the data.frame
View(features_total_plot)  # 319 observations

### use ggplot2::geom_rect to draw moleucles
### in this script each type of feature is drawn separately.

## step 1: set up the canvas in ggplot (no data added)
# want to put text on the left in the area of negative numbers
plot_start <- -max(features_total_plot$end)*0.2

# need it to be longer than the longest molecule plus a bit...
plot_end <- max(features_total_plot$end) + max(features_total_plot$end)*0.1

p <- ggplot() +
  ylim(0.5, max(features_total_plot$order)+0.5) +
  xlim(plot_start, plot_end)
p # show the object

## step 2: draw the CHAINS in grey (narrower)
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "CHAIN",],
                   colour = "black",
                   fill = "grey")

## step 3: annotate with names
p <- p +  annotate("text", x = -10, y = features_total_plot$order,
                   label = features_total_plot$entryName,
                   hjust = 1,
                   size = 4)
# check the object

## step 4: plot domains fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "DOMAIN",],

# label domains with descriptions
feat_domain <- features_total_plot[features_total_plot$type == "DOMAIN", ]
p <- p + geom_label(data = feat_domain,
                    aes(x = begin + (end-begin)/2,
                        y = order,
                        label = description))
# this works well with geom_label()

## step 5 plot motifs fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "MOTIF",],

# not going to label these as they're in the description and they are small

## step 6 plot repeats fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "REPEAT",],
                   fill = "lightblue")

# label repeats (for this they are ANK but remove digits)
p <- p + geom_text(data = features_total_plot[features_total_plot$type == "REPEAT",],
                    aes(x = begin + (end-begin)/2,
                        y = order,
                        label = gsub("\\d", "", description)),
                   size =2)

# add titles
p <- p + labs(x = "Amino acid number",         # label x-axis
              y = "",  # label y-axis
              title = "Schematic of five human NF-kappaB proteins",
              subtitle = "circles = phosphorylation sites\nRHD: Rel Homology Domain\nsource:Uniprot")

# white background and remove y-axis
p <- p + theme_bw() +  # white background
         panel.grid.major=element_blank()) +
         theme(axis.ticks = element_blank(),
               axis.text.y = element_blank()) +
         theme(panel.border = element_blank())

# add phosphorlation sites
# extract info about sites using phospho_site_info() function
phospho_sites <- drawProteins::phospho_site_info(features_total_plot)

# draw phosphorylation sites on the protein with geom_point()
p <- p + geom_point(data = phospho_sites,
               aes(x = begin,
                   y = order+0.25),
               shape = 21,
               colour = "black",
               fill = "yellow", size = 4, stroke = 1)

# to see it properly you need to zoom...

# I have to admit that I am pleased with this diagram.
# The colours are better, the labels seem to work.

p + ggsave("five_human_nfkappab_proteins_20170814.pdf")

The output looks best as a PDF, I think so that's why I've saved it.
There are lots of options for altering text size and the like.
I am still working on visualising protein schematics in R so if you have feedback, it would be most welcome.


No comments:

Post a Comment

Comments and suggestions are welcome.