R for Biochemists: 2021

Thursday, 30 September 2021

Data visualization with the programming language R - published.

Today, I published a piece in The Biochemist about data visualisation with R. I enjoyed writing the piece and it includes some of the code and blog posts that I have written here.

The code I used to create the figures has all been uploaded on Github. To reproduce the figures you can cut and paste the code from there into a script window on R-Studio.

Here is a list of pages with code for the figures.

Figure 1 - R allows reproducible data visualisations.

Figure 1A - My first published R data visualization – a cluster dendrogram from a proteomic data set. First published in 2014 and reproduced in 2021 (Alsagaby et al, 2014, https://doi.org/10.1021/pr5002803).
Figure 1B - A violin plot showing neutralizing antibody titres against five SARS-CoV-2 strains reproduced from shared data and code from Wall et al (2021, https://doi.org/10.1016/S0140-6736(21)01290-3).
Figure 1C - Phylogenetic tree of human kinase domains inspired by visualization by Manning et al (2002, https://doi.org/10.1126/science.1075762).

Figure 2 - Three most viewed data viz from this blog

Figure 2A - A volcano plot.
Figure 2B - A drug dose–response curve and LD50 calculation.
Figure 2C - A simple flow cytometry plot.

Figure 3 - Tidy Tuesday data visualisations

Figure 3A - A volcanic activity time line inspired by @ijeamaka_a
Figure 3B - Illustrating the importance of numbers in password strength. Across a range of password types, inclusion of numbers increases password strength.
Figure 3C (Github only) - Showing the proportion of female culprits in Scooby Doo shows from 1960s to 2020s.

Figure 4 - Showcasing drawProteins (below)

Figure 4A - The Hexsticker for the Bioconductor package drawProteins. (b)
Figure 4B (Github only) - Schematic of SARS-CoV S1 protein variants. Protein and variant information from Uniprot (https://www.uniprot.org/uniprot/P0DTC2).

There are lots more scripts on this blog. I hope you find it useful for learning R.

Do reach out if you need any help or if some of the code doesn't work. Either comment here, contact me through Github or twitter: @brennanpcardiff

Friday, 23 July 2021

Seems we'll all need more COVID vaccinations...

On Wednesday, July 20t0 2021, I saw this extended twitter feed by Dr Céline Gounder that reviewed the data and concepts underpinning whether we will need more COVID vaccines. It's a super long thread that is available here through Threadreader too. The thread felt really important and showed lots of data. One of the figures that interested me most is this one from a Lancet paper:

It is a lovely violin plot that shows a decrease in virus neutralisation by variant of the current corona virus (SARS‑CoV‑2). My understanding is that blood samples were taken from various people who had received various vaccines.

I was surprised when I looked at the paper because there was no figures in it. However, the figures are all in the Supplementary Material which you have to download as a PDF to read. The Supplementary Material also had a heart warming phrase: "All data (anonymised) and full R code to produce all figures and statistical analysis presented in this manuscript are freely-available online on Github: https://github.com/davidlvb/CrickUCLH-Legacy-VOCs-2021-05"

I was very happy to see the code and data available on Github and I forked the data to see if I could reproduce this violin plot. The short answer was yes. There were some minor challenges. I'm missing the text is at the top of the figure. However, I was able to reproduce the main body of the figure within about 20 minutes of engaging with the process. I was pleased.

Here is the graph I made which is a good reproduction of the one in the Twitter feed and Figure 1B in the supplementary data. Below is some simplified code extracted from the original code that is required to make the Figure 1B.

START

# making just Figure 1B - a violin plot

library(sp)

library(tidyverse)

library(khroma)

# download the data from my fork on Github

github_link <- "https://github.com/brennanpincardiff/Crick-UCLH-Legacy-VOCs-2021-05/blob/main/Crick_Legacy_2021-24-05_B-1-617-2_PUBLIC.Rda?raw=true"

load(url(github_link))

### from Lines 21 to 261 of original script

### Subset data for further analysis

studyData <- dtHashed %>%

filter(COVID_vaccStatus %in% c(1,2)) # Ignore individuals who are in seroconversion window following dose 1 and dose 2 of vaccine

### Set constants for various functions / plots

strainOrder <- c("Wuhan1", "D614G", "Kent", "SAfrica", "India2")

referenceIC50 <- 2^9.802672

dose2cohort <- studyData %>% filter(COVID_vaccStatus == 2, sampleOrderInVaccStatus == 1)

########################################################################

# Panel 1. Vaccine responses per strain following 2nd dose Pfizer #

########################################################################

relevantData <- dose2cohort %>%

pivot_longer(cols = ends_with("ic50"), names_to = "strain", values_to = "ic50")

relevantData$strain <- str_replace_all(relevantData$strain, pattern = "_ic50", replacement = "")

relevantData$strain <- fct_relevel(relevantData$strain, strainOrder)

outplot <- ggplot(relevantData, aes(x=strain, y=ic50, color = strain, label = sample_barcode)) +

scale_colour_muted() +

geom_hline(yintercept = referenceIC50,linetype = 3 ) +

geom_violin(trim=TRUE) +

scale_y_continuous(trans='log2',

breaks=c(5, 10, 64, 256, 1024, 5120),

labels=c("[0]", "[<40]", "64", "256", "1024", "[>2560]"),

minor_breaks=c(32, 128, 512, 2048)) +

ylab(bquote('Virus Neutralisation, '~IC[50]~ '')) +

geom_jitter(shape=20, position=position_jitter(0.2), alpha=0.3) +

stat_summary(fun=median, geom = "point", color="black", shape=5, size=1, stroke=1) +

theme_bw(base_family = "Helvetica Neue Thin") +

theme(legend.position="none")+

theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +

theme(axis.text.y = element_text(size=12)) +

theme(

panel.grid.major.x = element_blank(),

panel.grid.minor.x = element_blank(),

axis.title.y=element_text(size=15),

axis.title.x=element_blank()

)

outplot

## END of this SCRIPT

Useful resources

Making a box and whisker plot with some published proteomic data...
Original paper: "Neutralising antibody activity against SARS-CoV-2 VOCs B.1.617.2 and B.1.351 by BNT162b2 vaccination" by Wall et al https://doi.org/10.1016/S0140-6736(21)01290-3
David LV Bauer on Github: https://github.com/davidlvb
David LV Bauer lab home page
My fork of the data from the Lancet paper on Github which includes this script

Thursday, 21 January 2021

Programmatic visualization of UK SARS-CoV-2 spike protein variant

Building on my visualization of SARS-CoV-2 spike proteins, this script provides a R script to allow you to draw a schematic of the corona virus S1 spike protein and the UK variant that has changes within the S1 protein.

Here is the visualisation and below is the code to make it.

START

# viz the changes of the UK variant in S1 spike protein....

library(drawProteins)

library(ggplot2)

library(tidyverse)

# download protein data from

# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2

drawProteins::get_features("P0DTC2") -> spike_sars

drawProteins::feature_to_dataframe(spike_sars) -> spike_data

# pull out S1 chain... begins 13 ends: 685

spike_data %>%

filter(begin > 12 & end < 686) -> s1_bot

# duplicate this and put order = 2

s1_top <- s1_bot

s1_top$order <- 2

# combine these two

s1_both <- rbind(s1_top, s1_bot)

# draw canvas, chains & regions

draw_canvas(s1_both) -> p

p <- draw_chains(p, s1_both, labels = c("S1 protein", "B.1.1.7 variant"))

p <- draw_regions(p, s1_both)

# here are the details of the changes...

# https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563

uk_variant <- tribble(

~type, ~description, ~begin, ~end, ~length, ~accession, ~entryName, ~taxid,

~order,

"B.1.1.7", "deletion", 69, 70, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "deletion", 144, 144, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 501, 501, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 570, 570, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 681, 681, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 716, 716, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 982, 982, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 1118, 1118, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

)

# overlay information about the variants

p <- p + geom_point(data = filter(uk_variant, begin < 686),

aes(x = begin,

y = order+0.2,

shape = description), size = 5)

# style the plot a bit...

p <- p + theme_bw(base_size = 14) + # white background

theme(panel.grid.minor=element_blank(),

panel.grid.major=element_blank()) +

theme(axis.ticks = element_blank(),

axis.text.y = element_blank()) +

theme(panel.border = element_blank()) +

theme(legend.position = "bottom")

p <- p + labs(title = "Schematic of SARS-CoV-2 S1 Protein and UK variant",

subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2)")

END

I feel this could, and maybe will, be better but I'm stopping for now :-)

Some Resources

For more help, bug reports or to suggest features

drawProteins on Github
Bioconductor forum for questions
If you use drawProteins in a publication, please cite my paper.

Saturday, 16 January 2021

Programmatic visualization of SARS-CoV-2 Spike Protein

The SARS-CoV-2 Spike protein and its variants are very important at the moment. I thought it would be interesting to showcase using my Bioconductor package drawProteins to programmatically draw a visualization of the Spike protein. This helped me understand a little more about the protein too.

The data for making the visualization is from Uniprot: https://www.uniprot.org/uniprot/P0DTC2

Here is the visualisation and below is the code to make it.

START

library(drawProteins)

library(ggplot2)

library(tidyverse)

# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2

drawProteins::get_features("P0DTC2") -> spike_sars

drawProteins::feature_to_dataframe(spike_sars) -> spike_data

# From the Uniprot entry, it say that the Spike protein

# is made as a single protein and then processed into

# S1 and S2 protein.

# thus the Uniprot entry has multiple chains

# Processing Uniprot data to create different proteins

# pull out full length chain

spike_data %>%

filter(begin < 685 & end == 1273) -> spike_data_1

# want this at the top...

spike_data_1$order = 3

# pull out S1 chain... begins 13 ends: 685

spike_data %>%

filter(begin < 685 & end < 686) -> spike_data_2

# want this next...

spike_data_2$order = 2

# pull out S2 chain... begins 686; ends: 1273

spike_data %>%

filter(begin > 685 & end < 1274) -> spike_data_3

# want this at the bottom

spike_data_3$order = 1

# combine all back for plotting

spike_data_o <- rbind(spike_data_1, spike_data_2, spike_data_3)

# pull out names for chains

spike_data_o %>%

filter(type == "CHAIN") -> chains

chain_names <- c(chains$description[1:3], "")

# draw canvas and chains...

draw_canvas(spike_data_o) -> p

p <- draw_chains(p, spike_data_o,

labels = chain_names)

# add regions to S1 and S2

p <- draw_regions(p, spike_data_o)

p <- p + theme_bw(base_size = 14) + # white background

theme(panel.grid.minor=element_blank(),

panel.grid.major=element_blank()) +

theme(axis.ticks = element_blank(),

axis.text.y = element_blank()) +

theme(panel.border = element_blank()) +

theme(legend.position = "bottom")

p <- p + labs(title = "Schematic of SARS-CoV-2 Spike Protein",

subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2")

END

Some Resources

For more help, bug reports or to suggest features

drawProteins on Github
Bioconductor forum for questions
If you use drawProteins in a publication, please cite my paper.

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Thursday, 30 September 2021

Data visualization with the programming language R - published.

Friday, 23 July 2021

Seems we'll all need more COVID vaccinations...

Thursday, 21 January 2021

Programmatic visualization of UK SARS-CoV-2 spike protein variant

Some Resources

For more help, bug reports or to suggest features

Saturday, 16 January 2021

Programmatic visualization of SARS-CoV-2 Spike Protein

Some Resources

For more help, bug reports or to suggest features