R for Biochemists: 2020

Monday, 18 May 2020

Volcano time line by Tidy Tuesday...

We've been exploring the Tidy Tuesday datasets at our CaRdiff UseR group. It is good to look at a different types of data and the inspiration on Twitter. I tried to create a time line of volcano eruptions but it was a complete mess. Then I found this data visualisation by @ijeamaka_a which inspired me.

––

She very helpfully shared her code through Github so I was able to have a play with her code and use it to make some plots for other parts of the world and time frames.

Here is an example of a plot I have made which has eruptions in Indonesia between 1250 and 2020 with a volcano explosion index greater than 2.

Using this script it is possible to explore other places that have lots of earthquakes like Japan. This is a nice map of the volcanos in Japan.

START
## Inspiration from https://twitter.com/ijeamaka_a/status/1260660823229214724/photo/1
## https://github.com/Ijeamakaanyene/data_visualizations/blob/master/scripts/2020_10_volcanos.Rmd

library(dplyr)
library(tidyr)
library(ggplot2)

# download the data
volcano = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')
eruptions = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')

# exploring the data a bit
## check countries with most eruptions...
erupt_by_country = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(country) %>%
count(sort = TRUE)
head(erupt_by_country)

## what size eruptions are most common...
count_erupt_size = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(vei) %>%
count(sort = TRUE)
## vei not available for lots of eruptions - historical I guess...
count_erupt_size

# create the bound data_set
all_eruptions_merged = left_join(eruptions, select(volcano, volcano_number,
country, subregion),
by = c("volcano_number"="volcano_number")) %>%
mutate(combo_year = as.numeric(paste(start_year, start_month, sep = ".")))

# allow some customization
## add your country, start year and end year
## and size of eruption from 1 to 7 or zero to see all.
my_country <- "Indonesia"
my_start_year <- 1250
my_end_year <- 2020
size <- 2

# filter merged data set...
eruptions_merged = all_eruptions_merged %>%
filter(country == my_country) %>%
filter(is.na(vei) == FALSE) %>%
filter(start_year > my_start_year) %>%
filter(start_year < my_end_year) %>%
filter(vei > size)

# Using for loop to create data needed to plot a geom_polygon
volcano_polygon_list = list()
years = unlist(eruptions_merged$combo_year)
volcano_ids = unlist(eruptions_merged$volcano_number)
veis = unlist(eruptions_merged$vei)

## There is most likely a better way of doing this.. but I cannot think of it!
## so for this to work on a wider than 20 year window we need to
## change the value for making the pyramid
## 0.25 is approx 1% of the number of decades
pyr_size <- (my_end_year - my_start_year)/100
for(i in 1:length(years)){
volcano_polygon_df = data.frame(
x = c(years[i], years[i] + pyr_size, years[i] + pyr_size*2),
y = c(0, veis[i], 0),
t = rep(volcano_ids[i], 3)
)
volcano_polygon_list[[i]] = volcano_polygon_df
}

# Converting into df and adding subregion information
volcano_polygon_df = volcano_polygon_list %>%
bind_rows() %>%
left_join(., select(eruptions_merged, volcano_number, subregion),
by = c("t" = "volcano_number"))

# create the plot
volcano_timeline = ggplot() +
geom_polygon(data = volcano_polygon_df, aes(x = x, y = y, group = t, fill = subregion),
alpha = 0.75, colour = "black") +
geom_segment(aes(y = 0, yend = 0, x = my_start_year, xend = my_end_year),
size = 1,
colour = "black",
arrow = arrow()) +
scale_x_continuous(limits = c(my_start_year, my_end_year),
expand = c(0.005, 0.005)) +
scale_y_continuous(expand = c(0, 0))


## add titles - based on the values put in...
volcano_timeline <- volcano_timeline +
labs(y = "Explosion Index", x = NULL, fill = NULL,
title = paste0("Volcanic Activity Timeline (vei>", size, ") ",
my_start_year, " to ", my_end_year, " within ", my_country),
subtitle = "Each triangle's height represents the erruptions volcanic explosion index.",
caption = paste0("Source: The Smithsonian Institution\n",
"Visualization: Paul Brennan | @brennanpcardiff adapted from Ijeamaka Anyene | @ijeamaka_a"))

## brown colour theme...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "BrwnYl"))

## Teal Green colour theme give the plot below...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "TealGrn"))

END

Resources

Thursday, 30 April 2020

Exploring Phantom ticket prices with Tidy Tuesday...

Not biological data but an interesting Tidy Tuesday data set.

It is about Broadway Shows. Here is a plot that shows that the Phantom of the Opera seems to have become relative cheaper in a more diverse market.

START
# a script for Tidy Tuesday and Tidy Thursday at CaRdiff UseR group
# 30 April 2020

library(tidyverse)
# pull in the data
grosses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-28/grosses.csv', guess_max = 40000)

# How many unique shows...
length(unique(grosses$show))
# 1122

# inspration from here: https://github.com/teunbrand/tidytuesdayscripts/blob/master/scripts/2020_04_28_Broadway_Grosses.R
# and image from here: https://twitter.com/TeunvandenBrand/status/1255253561535074306/photo/1

# first make basic graph = average ticket prices over time...

ggplot(grosses, aes(week_ending, avg_ticket_price)) +
geom_point(alpha = 0.25)
# $500 shows depresses the axis a bit...

grosses %>%
filter(avg_ticket_price > 400) -> expensive
# this show is Springsteen on Broadway at $500!!!
# can exclude by limiting the y-axis

# filter for the Phantom
grosses %>%
filter(show == "The Phantom of the Opera") %>%
ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
geom_point() +
geom_smooth() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera",
subtitle = "Data from Playbill via Tidy Tuesday")

# why does the show seem very expensive for some weeks rather than others

grosses %>%
filter(show == "Mamma Mia!") %>%
ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
geom_point() +
geom_smooth() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera",
subtitle = "Data from Playbill via Tidy Tuesday")

# how does Phantom of the Opera compare to the others?
# make one plot first
plot <- ggplot(grosses, aes(week_ending, avg_ticket_price)) +
geom_point(alpha = 0.25, size = 0.5) + ylim(0,300)

# filter Phantom
grosses %>%
filter(show == "The Phantom of the Opera") -> p_of_op

# add to the first plot...
plot2 <- plot +
geom_point(data = p_of_op, aes(week_ending, avg_ticket_price),
colour = "red")

# show with titles
plot2 + theme_bw() +
labs(x = "",
y = "Average Ticket Price ($)",
title = "Phantom of the Opera (red): relatively cleaper in a more diverse market",
subtitle = "More expensive some weeks!\nData from Playbill via Tidy Tuesday")

END

Resources:

Official Repo for Tidy Tuesday
Tidy Tuesday hash tag on twitter for inspiration
Another Tidy Tuesday about Passwords

Thursday, 23 April 2020

Analysing an ELISA standard curve...

As part of R for Biochemists 101, the training programme created for the Biochemical Society, one of the participants, Abigail Byford (a British Heart Foundation 4-year PhD student from Leeds Institute of Cardiovascular and Metabolic Medicine) shared some of her data. She performed an ELISA for human chorionic gonadotropin (hCG) - a pregnancy hormone.

An ELISA generated colour that is read using a 96 well plate spectrophotometer that measures absorbance at a specific wavelength.

This script illustrates working with 96 well plate data, working with a standard curve and calculating unknowns.

--- START ---
library(readxl)
library(ggplot2)
library(dplyr)

# import data from Excel file no Github
# this comes straight from the spectrophotometer
link <- "https://github.com/brennanpincardiff/RforBiochemists/raw/master/data/hCG_absorbances.xlsx"

download.file(url=link, destfile="hCG_absorbances.xlsx", mode="wb")

hcg_file <- read_xlsx("hCG_absorbances.xlsx")

wavelength <- hcg_file[17,2]
days <- hcg_file[5,2]
# Excel stores the number of days since Jan-0-1900
# https://stackoverflow.com/questions/25158969/read-xlsx-reading-dates-wrong-if-non-date-in-column
# adding days to a date:
# https://stackoverflow.com/questions/10322035/r-adding-days-to-a-date
date <- as.Date("1900-01-01") + as.numeric(days) - 2
a

# this allows us a first look at the file...
# there is lots of metadata at the top which we don't need right now..

# we can skip at least the first 23 rows

abs_data <- read_xlsx("hCG_absorbances.xlsx", skip = 23)

# standard curve values are:
hcg <- c(5, 50, 200, 500, 1000, 5, 50, 200, 500, 1000)

# these correspond to A to E in columns 1 and 2
# we can pull these out with subsetting [rows, colums]
abs_data[1:5,2] # gives first set of standards

# using c() and unlist() will turn these into a vector
abs_std <- c(unlist(abs_data[1:5,2]), unlist(abs_data[1:5,3]))

# plot the data
plot(abs_std~hcg)

# create the dataframe
stand_curve <- data.frame(hcg, abs_std)

# draw the graph...
ggplot(stand_curve, aes(x = hcg, y = abs_std))+
geom_point() +
geom_smooth()
# line begins to top out at high absorbance - assay limitation

# plot and make the linear plot with just the first three values
# where hcg is less than 250
# here is the data...
stand_curve %>%
filter(hcg<250) %>%
ggplot(aes(x = hcg, y = abs_std)) +
geom_point() +
stat_smooth(method = "lm", formula = y~x) +
xlab("hCG") +
ylab(paste("Absorbance", as.character(wavelength))) +
ggtitle(paste("hCG Standard Curve \n", date))

stand_curve <- filter(stand_curve, hcg<250)
line <- lm(stand_curve$abs_std ~ stand_curve$hcg)

# extract intercept and slope from our line object
int <- line$coefficients[1]
slope <- line$coefficients[2]

# now calculate the unknowns
# R will calculate the whole data frame for us
# and put it into another data frame.
# abs_data[,2:11] - subsets the numbers we want.
# round() reduces the number of decimal points.
hCG_Ukns <- round((abs_data[,2:11] - int)/slope, 1)

hCG_Ukns
# some of the unknowns are below the lowest standard but hey...
--- END ---

Resources

Tuesday, 10 March 2020

Exploring some corona virus time courses...

Updated 12 Mar 2020 - updating names including "United Kingdom" and adding system date to plot.

It's hard to avoid and the instinct is to get stressed and apathetic in alternating cycles.
I found a interesting graph on twitter and decided to explore some of the data.

There is an online interactive dashboard from John Hopkins University and they have shared the data on Github.

So here is a script to pull this data into R and make a few graphs.

There also seems to be two R packages (one by Rami Krispin on CRAN) and one by GuangchuangYu on Github but I haven't explored them yet...

I explored the data to compare the number of COVID 19 cases in the UK with France and Italy. Here is the graph:

Note the y-axis is a log scale not a linear one.

START
# exploring corona virus data...
library(tidyverse)
library(lubridate)
library(ggthemes)
# import the data
url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"

data <- read_csv(url)

# all the cases in the world
# plot cases in the world
data %>%
select(-`Province/State`, - Lat, -Long) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) %>%
ggplot(aes(x=`mdy(date)`, y=total_cases)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = "COVID-19 Cases worldwide",
subtitle = "source: https://github.com/CSSEGISandData/COVID-19") +
theme_economist_white() -> p1

p1

# plot cases in China
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` == "China")%>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> data2

p2 <- p1 %+% data2
p2 + ggtitle("COVID-19 cases in China")

# plot cases in Hubei - capital is Wuhan
data %>%
select(-`Country/Region`, - Lat, -Long) %>%
filter(`Province/State` == "Hubei") %>%
pivot_longer(-`Province/State`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> data3

p2 <- p1 %+% data3
p2 + ggtitle("COVID-19 cases Hubei Province")

# plot cases outside of China
data %>%
select(-`Province/State`, - Lat, -Long) %>%
# exclude Mainland China
filter(!`Country/Region` == "China") %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> world_less_china
p_wlc <- p1 %+% world_less_china
p_wlc + ggtitle("COVID-19 in world excluding China")
# fewer cases but increasing...

# let's plot some other countries
# make a function
plot_country <- function(country){
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` == country) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) %>%
ggplot(aes(x=`mdy(date)`, y=total_cases)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = paste("Cases in",country),
subtitle = "source: https://github.com/CSSEGISandData/COVID-19") +
theme_economist_white()

}

plot_country("Korea, South")

# something interesting happened here...

plot_country("Italy")

plot_country("Iran")

plot_country("United Kingdom")

# compare multiple countries
three_countries <- c("United Kingdom", "France", "Italy")
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` %in% three_countries) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases") %>%
filter(cases>0) %>%
ggplot(aes(x=mdy(date), y=cases, color = `Country/Region`)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = paste("Cases in UK, France & Italy"),
subtitle = paste("source: https://github.com/CSSEGISandData/COVID-19. Updated",
Sys.Date())) +
theme_economist_white()

END

Some resources:

Analyzing 2019-nCoV (Coronavirus) data — Why Quality & Sense checks are so important in Data Science?

John Hopkins Interactive Dashboard again

Treads of links to science research papers curated by Dr Muge Cevik

More dot plots exploring the Chance of dying for males and females using the UK Life Tables

Key packages: tidyverse, dplyr, lubridate

Wednesday, 15 January 2020

Numerate biochemists can make good passwords :-)

The great thing about programming skills is that they are very transferable. As part of my learning, I joined the Cardiff R User Group which allowed me to learn from professionals from other industries. For 2020, our user group is planning to host monthly workshops playing with the Tidy Tuesday datasets.

The dataset for this month is about passwords. Because my interest is data visualisation, I have created this graph. Basically my conclusion is that nerdy passwords with numbers are a good way to go. Other and better password advice is available.

Here is the code:
# START
library(tidyverse)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')

passwords <- passwords[1:500,]

# frequency of passwords
# interesting plot about the frequency of passwords...
ggplot(passwords, aes(str_length(password))) + geom_histogram()

# plot rank v strength
p <- ggplot(passwords, aes(rank, strength, colour = factor(category))) +
geom_point()

p

p + facet_wrap(~category) + theme(legend.position = "none")

# pays be nerdy and numeric...

# numbers are probably the key...
# want to see if they have numbers
# colour these by number
passwords$digit <- str_detect(passwords$password, "[1234567890]")

p1 <- ggplot(passwords, aes(rank, strength, colour = digit)) +
geom_point()

p1 + facet_wrap(~category)

# look at one category - pays to be a nerd?
passwords %>%
filter(category == "nerdy-pop") %>%
ggplot(aes(rank, strength, colour = digit)) + geom_point()

# what are the better passwords
# filter on nerdy-pop, simple-alphanumeric, sport and password-related
good_words_in <- c("nerdy-pop", "simple-alphanumeric", "sport", "password-related")

passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit)) +
geom_point() + facet_wrap(~category)

# https://stackoverflow.com/questions/15624656/label-points-in-geom-point
passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit)) +
geom_point() +
geom_text(aes(label=ifelse(strength>20,as.character(password),''),
vjust=1)) +
facet_wrap(~category)
# works but labels are not the best :-(

# try ggrepel
# https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
library(ggrepel)

passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit, label = password)) +
geom_point() + geom_text_repel()
# this works but is a mess!!!!
# need to remove labels for strength < 20.

# create a data set with strength greater than 20
passwords %>%
filter(strength >20) -> better_passwords

# overlay these labels with a different dataset...
passwords %>%
filter(category %in% good_words_in) %>%
ggplot(aes(rank, strength, colour = digit)) +
geom_point() -> graphs

graphs

graphs + facet_wrap(~category)

g_label <- graphs + geom_text_repel(data = better_passwords,
aes(rank, strength, label=password))
g_label
# works nicely

g_label + facet_wrap(~category)
# works nicely

link <- "https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md"
# add some styling...
g_label + facet_wrap(~category) +
labs(x = "Password Rank",
y = "Password Strength",
title = "Exploring Passwords for Tidy Tuesday",
subtitle = link) +
theme_bw()

# this is my final plot, I think!!

Some resources:

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points