R for Biochemists

Wednesday, 5 March 2025

Mapping the UK with Geopackage file format and the sp package...

So the GADM maps and data site has changed the format that is shares so new R packages have been created. As such some of the code delivered in the Feb 2025 version of the R for Biochemists 101 course delivered by the Biochemical Society doesn't work.

As of 5 March 2025, one of the file formats available is Geopackage which can be opened by the R package sf. This is the CRAN package page for the sf package and here is the Github page.

I've written some code to plot the UK.

## START

# script to plot United Kingdom as per the exercise in R101, Module 5

# install.packages("sf")

library(sf)

library(ggplot2)

# identify the URL

link <- "https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_GBR.gpkg"

# download the data

download.file(url=link, destfile="file.gpkg", mode="wb")

# it's a big file and takes a while to download

# check layers within the data

st_layers("file.gpkg")

# this function is from the package sf

# output

# Driver: GPKG

# Available layers:

# layer_name geometry_type features fields crs_name

# 1 ADM_ADM_0 Multi Polygon 1 2 WGS 84

# 2 ADM_ADM_1 Multi Polygon 5 11 WGS 84

# 3 ADM_ADM_2 Multi Polygon 183 13 WGS 84

# 4 ADM_ADM_3 Multi Polygon 406 16 WGS 84

# 5 ADM_ADM_4 Multi Polygon 9111 14 WGS 84

# this file contains a lot of data

# open one of the layers in the file

gb <- st_read("file.gpkg", layer = "ADM_ADM_2")

# plot the map

ggplot() + geom_sf(data = gb)

# this take a while but it does work...

# pull out the data of layer 1 with 5 features

gb_1 <- st_read("file.gpkg", layer = "ADM_ADM_1")

# plot the file

ggplot() + geom_sf(data = gb_1)

# still quite slow... gives a nice map with England, Northern Ireland,

# Scotland and Wales

## END

Thursday, 30 September 2021

Data visualization with the programming language R - published.

Today, I published a piece in The Biochemist about data visualisation with R. I enjoyed writing the piece and it includes some of the code and blog posts that I have written here.

The code I used to create the figures has all been uploaded on Github. To reproduce the figures you can cut and paste the code from there into a script window on R-Studio.

Here is a list of pages with code for the figures.

Figure 1 - R allows reproducible data visualisations.

Figure 1A - My first published R data visualization – a cluster dendrogram from a proteomic data set. First published in 2014 and reproduced in 2021 (Alsagaby et al, 2014, https://doi.org/10.1021/pr5002803).
Figure 1B - A violin plot showing neutralizing antibody titres against five SARS-CoV-2 strains reproduced from shared data and code from Wall et al (2021, https://doi.org/10.1016/S0140-6736(21)01290-3).
Figure 1C - Phylogenetic tree of human kinase domains inspired by visualization by Manning et al (2002, https://doi.org/10.1126/science.1075762).

Figure 2 - Three most viewed data viz from this blog

Figure 2A - A volcano plot.
Figure 2B - A drug dose–response curve and LD50 calculation.
Figure 2C - A simple flow cytometry plot.

Figure 3 - Tidy Tuesday data visualisations

Figure 3A - A volcanic activity time line inspired by @ijeamaka_a
Figure 3B - Illustrating the importance of numbers in password strength. Across a range of password types, inclusion of numbers increases password strength.
Figure 3C (Github only) - Showing the proportion of female culprits in Scooby Doo shows from 1960s to 2020s.

Figure 4 - Showcasing drawProteins (below)

Figure 4A - The Hexsticker for the Bioconductor package drawProteins. (b)
Figure 4B (Github only) - Schematic of SARS-CoV S1 protein variants. Protein and variant information from Uniprot (https://www.uniprot.org/uniprot/P0DTC2).

There are lots more scripts on this blog. I hope you find it useful for learning R.

Do reach out if you need any help or if some of the code doesn't work. Either comment here, contact me through Github or twitter: @brennanpcardiff

Friday, 23 July 2021

Seems we'll all need more COVID vaccinations...

On Wednesday, July 20t0 2021, I saw this extended twitter feed by Dr Céline Gounder that reviewed the data and concepts underpinning whether we will need more COVID vaccines. It's a super long thread that is available here through Threadreader too. The thread felt really important and showed lots of data. One of the figures that interested me most is this one from a Lancet paper:

It is a lovely violin plot that shows a decrease in virus neutralisation by variant of the current corona virus (SARS‑CoV‑2). My understanding is that blood samples were taken from various people who had received various vaccines.

I was surprised when I looked at the paper because there was no figures in it. However, the figures are all in the Supplementary Material which you have to download as a PDF to read. The Supplementary Material also had a heart warming phrase: "All data (anonymised) and full R code to produce all figures and statistical analysis presented in this manuscript are freely-available online on Github: https://github.com/davidlvb/CrickUCLH-Legacy-VOCs-2021-05"

I was very happy to see the code and data available on Github and I forked the data to see if I could reproduce this violin plot. The short answer was yes. There were some minor challenges. I'm missing the text is at the top of the figure. However, I was able to reproduce the main body of the figure within about 20 minutes of engaging with the process. I was pleased.

Here is the graph I made which is a good reproduction of the one in the Twitter feed and Figure 1B in the supplementary data. Below is some simplified code extracted from the original code that is required to make the Figure 1B.

START

# making just Figure 1B - a violin plot

library(sp)

library(tidyverse)

library(khroma)

# download the data from my fork on Github

github_link <- "https://github.com/brennanpincardiff/Crick-UCLH-Legacy-VOCs-2021-05/blob/main/Crick_Legacy_2021-24-05_B-1-617-2_PUBLIC.Rda?raw=true"

load(url(github_link))

### from Lines 21 to 261 of original script

### Subset data for further analysis

studyData <- dtHashed %>%

filter(COVID_vaccStatus %in% c(1,2)) # Ignore individuals who are in seroconversion window following dose 1 and dose 2 of vaccine

### Set constants for various functions / plots

strainOrder <- c("Wuhan1", "D614G", "Kent", "SAfrica", "India2")

referenceIC50 <- 2^9.802672

dose2cohort <- studyData %>% filter(COVID_vaccStatus == 2, sampleOrderInVaccStatus == 1)

########################################################################

# Panel 1. Vaccine responses per strain following 2nd dose Pfizer #

########################################################################

relevantData <- dose2cohort %>%

pivot_longer(cols = ends_with("ic50"), names_to = "strain", values_to = "ic50")

relevantData$strain <- str_replace_all(relevantData$strain, pattern = "_ic50", replacement = "")

relevantData$strain <- fct_relevel(relevantData$strain, strainOrder)

outplot <- ggplot(relevantData, aes(x=strain, y=ic50, color = strain, label = sample_barcode)) +

scale_colour_muted() +

geom_hline(yintercept = referenceIC50,linetype = 3 ) +

geom_violin(trim=TRUE) +

scale_y_continuous(trans='log2',

breaks=c(5, 10, 64, 256, 1024, 5120),

labels=c("[0]", "[<40]", "64", "256", "1024", "[>2560]"),

minor_breaks=c(32, 128, 512, 2048)) +

ylab(bquote('Virus Neutralisation, '~IC[50]~ '')) +

geom_jitter(shape=20, position=position_jitter(0.2), alpha=0.3) +

stat_summary(fun=median, geom = "point", color="black", shape=5, size=1, stroke=1) +

theme_bw(base_family = "Helvetica Neue Thin") +

theme(legend.position="none")+

theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +

theme(axis.text.y = element_text(size=12)) +

theme(

panel.grid.major.x = element_blank(),

panel.grid.minor.x = element_blank(),

axis.title.y=element_text(size=15),

axis.title.x=element_blank()

)

outplot

## END of this SCRIPT

Useful resources

Making a box and whisker plot with some published proteomic data...
Original paper: "Neutralising antibody activity against SARS-CoV-2 VOCs B.1.617.2 and B.1.351 by BNT162b2 vaccination" by Wall et al https://doi.org/10.1016/S0140-6736(21)01290-3
David LV Bauer on Github: https://github.com/davidlvb
David LV Bauer lab home page
My fork of the data from the Lancet paper on Github which includes this script

Thursday, 21 January 2021

Programmatic visualization of UK SARS-CoV-2 spike protein variant

Building on my visualization of SARS-CoV-2 spike proteins, this script provides a R script to allow you to draw a schematic of the corona virus S1 spike protein and the UK variant that has changes within the S1 protein.

Here is the visualisation and below is the code to make it.

START

# viz the changes of the UK variant in S1 spike protein....

library(drawProteins)

library(ggplot2)

library(tidyverse)

# download protein data from

# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2

drawProteins::get_features("P0DTC2") -> spike_sars

drawProteins::feature_to_dataframe(spike_sars) -> spike_data

# pull out S1 chain... begins 13 ends: 685

spike_data %>%

filter(begin > 12 & end < 686) -> s1_bot

# duplicate this and put order = 2

s1_top <- s1_bot

s1_top$order <- 2

# combine these two

s1_both <- rbind(s1_top, s1_bot)

# draw canvas, chains & regions

draw_canvas(s1_both) -> p

p <- draw_chains(p, s1_both, labels = c("S1 protein", "B.1.1.7 variant"))

p <- draw_regions(p, s1_both)

# here are the details of the changes...

# https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563

uk_variant <- tribble(

~type, ~description, ~begin, ~end, ~length, ~accession, ~entryName, ~taxid,

~order,

"B.1.1.7", "deletion", 69, 70, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "deletion", 144, 144, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 501, 501, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 570, 570, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 681, 681, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 716, 716, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 982, 982, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

"B.1.1.7", "substitution", 1118, 1118, 1, "P0DTC2","SPIKE_SARS2", 2697049, 1,

)

# overlay information about the variants

p <- p + geom_point(data = filter(uk_variant, begin < 686),

aes(x = begin,

y = order+0.2,

shape = description), size = 5)

# style the plot a bit...

p <- p + theme_bw(base_size = 14) + # white background

theme(panel.grid.minor=element_blank(),

panel.grid.major=element_blank()) +

theme(axis.ticks = element_blank(),

axis.text.y = element_blank()) +

theme(panel.border = element_blank()) +

theme(legend.position = "bottom")

p <- p + labs(title = "Schematic of SARS-CoV-2 S1 Protein and UK variant",

subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2)")

END

I feel this could, and maybe will, be better but I'm stopping for now :-)

Some Resources

For more help, bug reports or to suggest features

drawProteins on Github
Bioconductor forum for questions
If you use drawProteins in a publication, please cite my paper.

Saturday, 16 January 2021

Programmatic visualization of SARS-CoV-2 Spike Protein

The SARS-CoV-2 Spike protein and its variants are very important at the moment. I thought it would be interesting to showcase using my Bioconductor package drawProteins to programmatically draw a visualization of the Spike protein. This helped me understand a little more about the protein too.

The data for making the visualization is from Uniprot: https://www.uniprot.org/uniprot/P0DTC2

Here is the visualisation and below is the code to make it.

START

library(drawProteins)

library(ggplot2)

library(tidyverse)

# Uniprot link: https://www.uniprot.org/uniprot/P0DTC2

drawProteins::get_features("P0DTC2") -> spike_sars

drawProteins::feature_to_dataframe(spike_sars) -> spike_data

# From the Uniprot entry, it say that the Spike protein

# is made as a single protein and then processed into

# S1 and S2 protein.

# thus the Uniprot entry has multiple chains

# Processing Uniprot data to create different proteins

# pull out full length chain

spike_data %>%

filter(begin < 685 & end == 1273) -> spike_data_1

# want this at the top...

spike_data_1$order = 3

# pull out S1 chain... begins 13 ends: 685

spike_data %>%

filter(begin < 685 & end < 686) -> spike_data_2

# want this next...

spike_data_2$order = 2

# pull out S2 chain... begins 686; ends: 1273

spike_data %>%

filter(begin > 685 & end < 1274) -> spike_data_3

# want this at the bottom

spike_data_3$order = 1

# combine all back for plotting

spike_data_o <- rbind(spike_data_1, spike_data_2, spike_data_3)

# pull out names for chains

spike_data_o %>%

filter(type == "CHAIN") -> chains

chain_names <- c(chains$description[1:3], "")

# draw canvas and chains...

draw_canvas(spike_data_o) -> p

p <- draw_chains(p, spike_data_o,

labels = chain_names)

# add regions to S1 and S2

p <- draw_regions(p, spike_data_o)

p <- p + theme_bw(base_size = 14) + # white background

theme(panel.grid.minor=element_blank(),

panel.grid.major=element_blank()) +

theme(axis.ticks = element_blank(),

axis.text.y = element_blank()) +

theme(panel.border = element_blank()) +

theme(legend.position = "bottom")

p <- p + labs(title = "Schematic of SARS-CoV-2 Spike Protein",

subtitle = "Source: Uniprot (https://www.uniprot.org/uniprot/P0DTC2")

END

Some Resources

For more help, bug reports or to suggest features

drawProteins on Github
Bioconductor forum for questions
If you use drawProteins in a publication, please cite my paper.

Monday, 18 May 2020

Volcano time line by Tidy Tuesday...

We've been exploring the Tidy Tuesday datasets at our CaRdiff UseR group. It is good to look at a different types of data and the inspiration on Twitter. I tried to create a time line of volcano eruptions but it was a complete mess. Then I found this data visualisation by @ijeamaka_a which inspired me.

––

She very helpfully shared her code through Github so I was able to have a play with her code and use it to make some plots for other parts of the world and time frames.

Here is an example of a plot I have made which has eruptions in Indonesia between 1250 and 2020 with a volcano explosion index greater than 2.

Using this script it is possible to explore other places that have lots of earthquakes like Japan. This is a nice map of the volcanos in Japan.

START
## Inspiration from https://twitter.com/ijeamaka_a/status/1260660823229214724/photo/1
## https://github.com/Ijeamakaanyene/data_visualizations/blob/master/scripts/2020_10_volcanos.Rmd

library(dplyr)
library(tidyr)
library(ggplot2)

# download the data
volcano = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')
eruptions = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')

# exploring the data a bit
## check countries with most eruptions...
erupt_by_country = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(country) %>%
count(sort = TRUE)
head(erupt_by_country)

## what size eruptions are most common...
count_erupt_size = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(vei) %>%
count(sort = TRUE)
## vei not available for lots of eruptions - historical I guess...
count_erupt_size

# create the bound data_set
all_eruptions_merged = left_join(eruptions, select(volcano, volcano_number,
country, subregion),
by = c("volcano_number"="volcano_number")) %>%
mutate(combo_year = as.numeric(paste(start_year, start_month, sep = ".")))

# allow some customization
## add your country, start year and end year
## and size of eruption from 1 to 7 or zero to see all.
my_country <- "Indonesia"
my_start_year <- 1250
my_end_year <- 2020
size <- 2

# filter merged data set...
eruptions_merged = all_eruptions_merged %>%
filter(country == my_country) %>%
filter(is.na(vei) == FALSE) %>%
filter(start_year > my_start_year) %>%
filter(start_year < my_end_year) %>%
filter(vei > size)

# Using for loop to create data needed to plot a geom_polygon
volcano_polygon_list = list()
years = unlist(eruptions_merged$combo_year)
volcano_ids = unlist(eruptions_merged$volcano_number)
veis = unlist(eruptions_merged$vei)

## There is most likely a better way of doing this.. but I cannot think of it!
## so for this to work on a wider than 20 year window we need to
## change the value for making the pyramid
## 0.25 is approx 1% of the number of decades
pyr_size <- (my_end_year - my_start_year)/100
for(i in 1:length(years)){
volcano_polygon_df = data.frame(
x = c(years[i], years[i] + pyr_size, years[i] + pyr_size*2),
y = c(0, veis[i], 0),
t = rep(volcano_ids[i], 3)
)
volcano_polygon_list[[i]] = volcano_polygon_df
}

# Converting into df and adding subregion information
volcano_polygon_df = volcano_polygon_list %>%
bind_rows() %>%
left_join(., select(eruptions_merged, volcano_number, subregion),
by = c("t" = "volcano_number"))

# create the plot
volcano_timeline = ggplot() +
geom_polygon(data = volcano_polygon_df, aes(x = x, y = y, group = t, fill = subregion),
alpha = 0.75, colour = "black") +
geom_segment(aes(y = 0, yend = 0, x = my_start_year, xend = my_end_year),
size = 1,
colour = "black",
arrow = arrow()) +
scale_x_continuous(limits = c(my_start_year, my_end_year),
expand = c(0.005, 0.005)) +
scale_y_continuous(expand = c(0, 0))


## add titles - based on the values put in...
volcano_timeline <- volcano_timeline +
labs(y = "Explosion Index", x = NULL, fill = NULL,
title = paste0("Volcanic Activity Timeline (vei>", size, ") ",
my_start_year, " to ", my_end_year, " within ", my_country),
subtitle = "Each triangle's height represents the erruptions volcanic explosion index.",
caption = paste0("Source: The Smithsonian Institution\n",
"Visualization: Paul Brennan | @brennanpcardiff adapted from Ijeamaka Anyene | @ijeamaka_a"))

## brown colour theme...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "BrwnYl"))

## Teal Green colour theme give the plot below...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "TealGrn"))

END

Resources

Thursday, 30 April 2020

Exploring Phantom ticket prices with Tidy Tuesday...

Not biological data but an interesting Tidy Tuesday data set.

It is about Broadway Shows. Here is a plot that shows that the Phantom of the Opera seems to have become relative cheaper in a more diverse market.

START
# a script for Tidy Tuesday and Tidy Thursday at CaRdiff UseR group
# 30 April 2020

library(tidyverse)
# pull in the data
grosses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-28/grosses.csv', guess_max = 40000)

# How many unique shows...
length(unique(grosses$show))
# 1122

# inspration from here: https://github.com/teunbrand/tidytuesdayscripts/blob/master/scripts/2020_04_28_Broadway_Grosses.R
# and image from here: https://twitter.com/TeunvandenBrand/status/1255253561535074306/photo/1

# first make basic graph = average ticket prices over time...

ggplot(grosses, aes(week_ending, avg_ticket_price)) +
geom_point(alpha = 0.25)
# $500 shows depresses the axis a bit...

grosses %>%
filter(avg_ticket_price > 400) -> expensive
# this show is Springsteen on Broadway at $500!!!
# can exclude by limiting the y-axis

# filter for the Phantom
grosses %>%
filter(show == "The Phantom of the Opera") %>%
ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
geom_point() +
geom_smooth() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera",
subtitle = "Data from Playbill via Tidy Tuesday")

# why does the show seem very expensive for some weeks rather than others

grosses %>%
filter(show == "Mamma Mia!") %>%
ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
geom_point() +
geom_smooth() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera",
subtitle = "Data from Playbill via Tidy Tuesday")

# how does Phantom of the Opera compare to the others?
# make one plot first
plot <- ggplot(grosses, aes(week_ending, avg_ticket_price)) +
geom_point(alpha = 0.25, size = 0.5) + ylim(0,300)

# filter Phantom
grosses %>%
filter(show == "The Phantom of the Opera") -> p_of_op

# add to the first plot...
plot2 <- plot +
geom_point(data = p_of_op, aes(week_ending, avg_ticket_price),
colour = "red")

# show with titles
plot2 + theme_bw() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera (red): relatively cleaper in a more diverse market",
subtitle = "More expensive some weeks!\nData from Playbill via Tidy Tuesday")

END

Resources:

Official Repo for Tidy Tuesday
Tidy Tuesday hash tag on twitter for inspiration
Another Tidy Tuesday about Passwords

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Wednesday, 5 March 2025

Mapping the UK with Geopackage file format and the sp package...

Thursday, 30 September 2021

Data visualization with the programming language R - published.

Friday, 23 July 2021

Seems we'll all need more COVID vaccinations...

Thursday, 21 January 2021

Programmatic visualization of UK SARS-CoV-2 spike protein variant

Some Resources

For more help, bug reports or to suggest features

Saturday, 16 January 2021

Programmatic visualization of SARS-CoV-2 Spike Protein

Some Resources

For more help, bug reports or to suggest features

Monday, 18 May 2020

Volcano time line by Tidy Tuesday...

Thursday, 30 April 2020

Exploring Phantom ticket prices with Tidy Tuesday...