Thursday, 30 September 2021

Data visualization with the programming language R - published.

Today, I published a piece in The Biochemist about data visualisation with R. I enjoyed writing the piece and it includes some of the code and blog posts that I have written here. 

The code I used to create the figures has all been uploaded on Github. To reproduce the figures you can cut and paste the code from there into a script window on R-Studio. 

Here is a list of pages with code for the figures. 

Figure 1 - R allows reproducible data visualisations. 


Figure 2 - Three most viewed data viz from this blog

Figure 3 - Tidy Tuesday data visualisations
  • Figure 3A - A volcanic activity time line inspired by @ijeamaka_a
  • Figure 3B - Illustrating the importance of numbers in password strength. Across a range of password types, inclusion of numbers increases password strength.
  • Figure 3C (Github only) - Showing the proportion of female culprits in Scooby Doo shows from 1960s to 2020s.
Figure 4 - Showcasing drawProteins (below)



There are lots more scripts on this blog. I hope you find it useful for learning R. 


Do reach out if you need any help or if some of the code doesn't work. Either comment here, contact me through Github or twitter: @brennanpcardiff


Friday, 23 July 2021

Seems we'll all need more COVID vaccinations...

On Wednesday, July 20t0 2021, I saw this extended twitter feed by Dr CĂ©line Gounder that reviewed the data and concepts underpinning whether we will need more COVID vaccines. It's a super long thread that is available here through Threadreader too. The thread felt really important and showed lots of data. One of the figures that interested me most is this one from a Lancet paper:

It is a lovely violin plot that shows a decrease in virus neutralisation by variant of the current corona virus (SARS‑CoV‑2). My understanding is that blood samples were taken from various people who had received various vaccines. 

I was surprised when I looked at the paper because there was no figures in it. However, the figures are all in the Supplementary Material which you have to download as a PDF to read. The Supplementary Material also had a heart warming phrase: "All data (anonymised) and full R code to produce all figures and statistical analysis presented in this manuscript are freely-available online on Github: https://github.com/davidlvb/CrickUCLH-Legacy-VOCs-2021-05"

I was very happy to see the code and data available on Github and I forked the data to see if I could reproduce this violin plot. The short answer was yes. There were some minor challenges. I'm missing the text is at the top of the figure. However, I was able to reproduce the main body of the figure within about 20 minutes of engaging with the process. I was pleased.

Here is the graph I made which is a good reproduction of the one in the Twitter feed and Figure 1B in the supplementary data. Below is some simplified code extracted from the original code that is required to make the Figure 1B.

START
# making just Figure 1B - a violin plot

library(sp)
library(tidyverse)
library(khroma)

# download the data from my fork on Github
github_link <- "https://github.com/brennanpincardiff/Crick-UCLH-Legacy-VOCs-2021-05/blob/main/Crick_Legacy_2021-24-05_B-1-617-2_PUBLIC.Rda?raw=true"
load(url(github_link))

### from Lines 21 to 261 of original script

### Subset data for further analysis
studyData <- dtHashed %>% 
    filter(COVID_vaccStatus %in% c(1,2)) # Ignore individuals who are in seroconversion window following dose 1 and dose 2 of vaccine

### Set constants for various functions / plots
strainOrder <- c("Wuhan1", "D614G", "Kent", "SAfrica", "India2")
referenceIC50 <- 2^9.802672

dose2cohort <- studyData %>% filter(COVID_vaccStatus == 2,  sampleOrderInVaccStatus == 1)

########################################################################
#   Panel 1. Vaccine responses per strain following 2nd dose Pfizer    #
########################################################################

relevantData <- dose2cohort %>%
    pivot_longer(cols = ends_with("ic50"), names_to = "strain", values_to = "ic50")
relevantData$strain <- str_replace_all(relevantData$strain, pattern = "_ic50", replacement = "")
relevantData$strain <- fct_relevel(relevantData$strain, strainOrder)


outplot <- ggplot(relevantData, aes(x=strain, y=ic50, color = strain, label = sample_barcode)) + 
    scale_colour_muted() +
    geom_hline(yintercept = referenceIC50,linetype = 3 ) +
    geom_violin(trim=TRUE) + 
    scale_y_continuous(trans='log2', 
                       breaks=c(5, 10, 64, 256, 1024, 5120), 
                       labels=c("[0]", "[<40]", "64", "256", "1024", "[>2560]"),
                       minor_breaks=c(32, 128, 512, 2048)) +
    ylab(bquote('Virus Neutralisation, '~IC[50]~ '')) +
    geom_jitter(shape=20, position=position_jitter(0.2), alpha=0.3) + 
    stat_summary(fun=median, geom = "point", color="black",  shape=5, size=1, stroke=1) + 
    theme_bw(base_family = "Helvetica Neue Thin") +
    theme(legend.position="none")+
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) + 
    theme(axis.text.y = element_text(size=12))  + 
    theme(
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.title.y=element_text(size=15),
        axis.title.x=element_blank()
    )

outplot
## END of this SCRIPT


Useful resources

Thursday, 21 January 2021

Programmatic visualization of UK SARS-CoV-2 spike protein variant

Building on my visualization of SARS-CoV-2 spike proteins,  this script provides a R script to allow you to draw a schematic of the corona virus S1 spike protein and the UK variant that has changes within the S1 protein. 

Here is the visualisation and below is the code to make it. 



START
# viz the changes of the UK variant in S1 spike protein....
library(drawProteins)
library(ggplot2)
library(tidyverse)

# download protein data from
# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2
drawProteins::get_features("P0DTC2") -> spike_sars
drawProteins::feature_to_dataframe(spike_sars) -> spike_data

# pull out S1 chain... begins 13 ends: 685
spike_data %>%
    filter(begin > 12 & end < 686) -> s1_bot

# duplicate this and put order = 2
s1_top <- s1_bot
s1_top$order <- 2

# combine these two 
s1_both <- rbind(s1_top, s1_bot)

# draw canvas, chains & regions
draw_canvas(s1_both) -> p
p <- draw_chains(p, s1_both, labels = c("S1 protein", "B.1.1.7 variant"))
p <- draw_regions(p, s1_both)


# here are the details of the changes...

uk_variant <- tribble(
    ~type, ~description, ~begin, ~end, ~length, ~accession, ~entryName, ~taxid,
    ~order,
    "B.1.1.7", "deletion", 69, 70, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "deletion", 144, 144, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "substitution", 501, 501, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "substitution", 570, 570, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,   
    "B.1.1.7", "substitution", 681, 681, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "substitution", 716, 716, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "substitution", 982, 982, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
    "B.1.1.7", "substitution", 1118, 1118, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,
)

# overlay information about the variants
p <- p + geom_point(data = filter(uk_variant, begin < 686),
                          aes(x = begin,
                              y = order+0.2, 
                              shape = description), size = 5)


# style the plot a bit...
p <- p + theme_bw(base_size = 14) + # white background
    theme(panel.grid.minor=element_blank(), 
          panel.grid.major=element_blank()) +
    theme(axis.ticks = element_blank(), 
          axis.text.y = element_blank()) +
    theme(panel.border = element_blank()) +
    theme(legend.position = "bottom")

p <- p + labs(title = "Schematic of SARS-CoV-2 S1 Protein and UK variant",
              subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2)")
p

END

I feel this could, and maybe will, be better but I'm stopping for now :-)

Some Resources

For more help, bug reports or to suggest features

Saturday, 16 January 2021

Programmatic visualization of SARS-CoV-2 Spike Protein

The SARS-CoV-2 Spike protein and its variants are very important at the moment. I thought it would be interesting to showcase using my Bioconductor package drawProteins to programmatically draw a visualization of the Spike protein. This helped me understand a little more about the protein too. 

The data for making the visualization is from Uniprot: https://www.uniprot.org/uniprot/P0DTC2

Here is the visualisation and below is the code to make it. 



START

library(drawProteins)

library(ggplot2)

library(tidyverse)


# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2

drawProteins::get_features("P0DTC2") -> spike_sars

drawProteins::feature_to_dataframe(spike_sars) -> spike_data


# From the Uniprot entry, it say that the Spike protein

# is made as a single protein and then processed into 

# S1 and S2 protein. 

# thus the Uniprot entry has multiple chains


# Processing Uniprot data to create different proteins

# pull out full length chain

spike_data %>% 

    filter(begin < 685 & end == 1273) -> spike_data_1

# want this at the top... 

spike_data_1$order = 3


# pull out S1 chain... begins 13 ends: 685

spike_data %>%

    filter(begin < 685 & end < 686) -> spike_data_2

# want this next... 

spike_data_2$order = 2


# pull out S2 chain...  begins 686; ends: 1273

spike_data %>%

    filter(begin > 685 & end < 1274) -> spike_data_3

# want this at the bottom

spike_data_3$order = 1


# combine all back for plotting

spike_data_o <- rbind(spike_data_1, spike_data_2, spike_data_3)


# pull out names for chains

spike_data_o %>%

    filter(type == "CHAIN") -> chains

chain_names <- c(chains$description[1:3], "")


# draw canvas and chains...

draw_canvas(spike_data_o) -> p

p <- draw_chains(p, spike_data_o,

                 labels = chain_names) 


# add regions to S1 and S2

p <- draw_regions(p, spike_data_o)


p <- p + theme_bw(base_size = 14) + # white background

    theme(panel.grid.minor=element_blank(), 

          panel.grid.major=element_blank()) +

    theme(axis.ticks = element_blank(), 

          axis.text.y = element_blank()) +

    theme(panel.border = element_blank()) +

    theme(legend.position = "bottom")


p <- p + labs(title = "Schematic of SARS-CoV-2 Spike Protein",

    subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2")

p

END


Some Resources

For more help, bug reports or to suggest features