R for Biochemists

Tuesday, 10 March 2020

Exploring some corona virus time courses...

Updated 12 Mar 2020 - updating names including "United Kingdom" and adding system date to plot.

It's hard to avoid and the instinct is to get stressed and apathetic in alternating cycles.
I found a interesting graph on twitter and decided to explore some of the data.

There is an online interactive dashboard from John Hopkins University and they have shared the data on Github.

So here is a script to pull this data into R and make a few graphs.

There also seems to be two R packages (one by Rami Krispin on CRAN) and one by GuangchuangYu on Github but I haven't explored them yet...

I explored the data to compare the number of COVID 19 cases in the UK with France and Italy. Here is the graph:

Note the y-axis is a log scale not a linear one.

START
# exploring corona virus data...
library(tidyverse)
library(lubridate)
library(ggthemes)
# import the data
url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"

data <- read_csv(url)

# all the cases in the world
# plot cases in the world
data %>%
select(-`Province/State`, - Lat, -Long) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) %>%
ggplot(aes(x=`mdy(date)`, y=total_cases)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = "COVID-19 Cases worldwide",
subtitle = "source: https://github.com/CSSEGISandData/COVID-19") +
theme_economist_white() -> p1

p1

# plot cases in China
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` == "China")%>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> data2

p2 <- p1 %+% data2
p2 + ggtitle("COVID-19 cases in China")

# plot cases in Hubei - capital is Wuhan
data %>%
select(-`Country/Region`, - Lat, -Long) %>%
filter(`Province/State` == "Hubei") %>%
pivot_longer(-`Province/State`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> data3

p2 <- p1 %+% data3
p2 + ggtitle("COVID-19 cases Hubei Province")

# plot cases outside of China
data %>%
select(-`Province/State`, - Lat, -Long) %>%
# exclude Mainland China
filter(!`Country/Region` == "China") %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) -> world_less_china
p_wlc <- p1 %+% world_less_china
p_wlc + ggtitle("COVID-19 in world excluding China")
# fewer cases but increasing...

# let's plot some other countries
# make a function
plot_country <- function(country){
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` == country) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases")%>%
group_by(mdy(date)) %>%
summarise(total_cases = sum(cases)) %>%
ggplot(aes(x=`mdy(date)`, y=total_cases)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = paste("Cases in",country),
subtitle = "source: https://github.com/CSSEGISandData/COVID-19") +
theme_economist_white()

}

plot_country("Korea, South")

# something interesting happened here...

plot_country("Italy")

plot_country("Iran")

plot_country("United Kingdom")

# compare multiple countries
three_countries <- c("United Kingdom", "France", "Italy")
data %>%
select(-`Province/State`, - Lat, -Long) %>%
filter(`Country/Region` %in% three_countries) %>%
pivot_longer(-`Country/Region`, names_to = "date", values_to = "cases") %>%
filter(cases>0) %>%
ggplot(aes(x=mdy(date), y=cases, color = `Country/Region`)) +
geom_point() +
scale_y_continuous(trans='log10') +
labs(x = "Date",
y = "Number of COVID-19 cases",
title = paste("Cases in UK, France & Italy"),
subtitle = paste("source: https://github.com/CSSEGISandData/COVID-19. Updated",
Sys.Date())) +
theme_economist_white()

END

Some resources:

Analyzing 2019-nCoV (Coronavirus) data — Why Quality & Sense checks are so important in Data Science?

John Hopkins Interactive Dashboard again

Treads of links to science research papers curated by Dr Muge Cevik

More dot plots exploring the Chance of dying for males and females using the UK Life Tables

Key packages: tidyverse, dplyr, lubridate

Wednesday, 15 January 2020

Numerate biochemists can make good passwords :-)

The great thing about programming skills is that they are very transferable. As part of my learning, I joined the Cardiff R User Group which allowed me to learn from professionals from other industries. For 2020, our user group is planning to host monthly workshops playing with the Tidy Tuesday datasets.

The dataset for this month is about passwords. Because my interest is data visualisation, I have created this graph. Basically my conclusion is that nerdy passwords with numbers are a good way to go. Other and better password advice is available.

Here is the code:
# START
library(tidyverse)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')

passwords <- passwords[1:500,]

# frequency of passwords
# interesting plot about the frequency of passwords...
ggplot(passwords, aes(str_length(password))) + geom_histogram()

# plot rank v strength
p <- ggplot(passwords, aes(rank, strength, colour = factor(category))) +
geom_point()

p

p + facet_wrap(~category) + theme(legend.position = "none")

# pays be nerdy and numeric...

# numbers are probably the key...
# want to see if they have numbers
# colour these by number
passwords$digit <- str_detect(passwords$password, "[1234567890]")

p1 <- ggplot(passwords, aes(rank, strength, colour = digit)) +
geom_point()

p1 + facet_wrap(~category)

# look at one category - pays to be a nerd?
passwords %>%
filter(category == "nerdy-pop") %>%
ggplot(aes(rank, strength, colour = digit)) + geom_point()

# what are the better passwords
# filter on nerdy-pop, simple-alphanumeric, sport and password-related
good_words_in <- c("nerdy-pop", "simple-alphanumeric", "sport", "password-related")

passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit)) +
geom_point() + facet_wrap(~category)

# https://stackoverflow.com/questions/15624656/label-points-in-geom-point
passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit)) +
geom_point() +
geom_text(aes(label=ifelse(strength>20,as.character(password),''),
vjust=1)) +
facet_wrap(~category)
# works but labels are not the best :-(

# try ggrepel
# https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
library(ggrepel)

passwords %>%
filter(category %in% good_words_in) %>% # the %in% is important!!
ggplot(aes(rank, strength, colour = digit, label = password)) +
geom_point() + geom_text_repel()
# this works but is a mess!!!!
# need to remove labels for strength < 20.

# create a data set with strength greater than 20
passwords %>%
filter(strength >20) -> better_passwords

# overlay these labels with a different dataset...
passwords %>%
filter(category %in% good_words_in) %>%
ggplot(aes(rank, strength, colour = digit)) +
geom_point() -> graphs

graphs

graphs + facet_wrap(~category)

g_label <- graphs + geom_text_repel(data = better_passwords,
aes(rank, strength, label=password))
g_label
# works nicely

g_label + facet_wrap(~category)
# works nicely

link <- "https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md"
# add some styling...
g_label + facet_wrap(~category) +
labs(x = "Password Rank",
y = "Password Strength",
title = "Exploring Passwords for Tidy Tuesday",
subtitle = link) +
theme_bw()

# this is my final plot, I think!!

Some resources:

Tuesday, 27 August 2019

Exploring cytokine data from fibroblasts with ggpubr...

EDITED 24 July 2020. Some changes to functions means that this script needed to be changed. It should work again now.

===
Last month, I wrote a blog post exploring some data from an interesting publication by Croft et al about the role of synovial fibroblasts in rheumatoid arthritis. The paper was "Distinct fibroblast subsets drive inflammation and damage in arthritis". A Biorxiv of the manuscript is available.

Today, while preparing teaching materials for R for Biochemists 201, an advanced R course I'm preparing for the Biochemical Society, I have been exploring the data in Figure 4a. The data analysis workflow requires data import, some tidying in Excel and R, summarisation and visualisation. The data in the Excel file was challenging. I have edited it to make it a little easier for me and put it on Github.

For the visualisation, I have been using the ggpubr package by Alboukadel Kassambara. ggpubr is a very interesting package that makes adding results from statistical tests to ggplots more straight forward. Alboukadel Kassambara has also created a VERY useful site for learning R.

To go back to the Nature paper by Croft et al and quote from the paper:

"FAPα+THY1− and FAPα+THY1+ cells isolated from day 9 STIA synovia were stimulated and analysed using luminex."

Below is my version of Figure 4a which shows the different cytokine profiles produced. It's not perfect but it's helped my learning a lot today.

Here is the script that makes my version of Figure 4a and some graphs along the way. The data exploration revealed a few issues about the data which I have mentioned at the bottom and along the way. I will email the corresponding and first authors with some questions. This has been a very interesting process. I would like to thank the authors for making the data available and Alboukadel Kassambara for the interesting ggpubr package.

START

# making figure 4a ...
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
library(ggpubr)

# DATA IMPORT
# read in the data - I have simplified it a bit...
link <- ("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/figure_4a_data_altered.xlsx")
download.file(url=link, destfile="file.xlsx", mode="wb")

data_t <- read_excel("file.xlsx")

# DATA TIDY
# transforming (rotate x and y) the data makes it easier for me...
data_tt <- t(data_t[2:13])
# turn it into a tibble
data_ttt <- as_tibble(data_tt)
# add colunm names - the names for each cytokine
colnames(data_ttt) <- toupper(data_t$Cytokine)

# use gather() function from dplyr package to change to long format
data_g <- gather(data_ttt[1:12], key = "cytokine")

# need to add thy status - down the whole column
thy <- c("minus", "minus", "minus", "minus",
"minus", "minus", "plus", "plus", "plus", "plus",
"plus", "plus")
thy_144 <- c(thy, thy, thy, thy, thy, thy,
thy, thy, thy, thy, thy, thy)

data_g <- add_column(data_g, thy = thy_144)
# then make it a factor
data_g$thy <-as.factor(data_g$thy)

# order of cytokines from paper
cytokine_order <- c("CXCL14", "CXCL5", "CXCL2", "CXCL12",
"CCL2", "CCL7", "CCL5", "CCL12",
"IL6", "MMP3", "MMP9", "MMP13")
# make these a factor with an order too...
# note: error in original data on the Nature file - "Cxc12" NOT "Cxcl12"
data_g$cytokine <- factor(data_g$cytokine, levels = cytokine_order)

# DATA SUMMARISE
# calculate means and create a new file with the means
data_g %>%
group_by(cytokine, thy) %>%
summarise(mean =mean(value)) -> cyto_means

# FIRST VISUALISATION of the means
ggplot(cyto_means, aes(x = thy,
y = mean)) +
geom_col() +
facet_wrap(~cytokine, scales ="free_y")

# MORE COMPLEX VISUALISATION
# let's try adding the points and then the stat tests...
# we go back to the original data
# without the summarise step
# use stat_summary() to calculate mean and plot the bars...
# and geom_jitter()
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") + # calculates means
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter()

# then the stats tests with stat_compare_means()
my_comparisons <- list( c("minus", "plus"))
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons)
# default Wilcoxon test - non parametric
# so values are different

# change method to t.test and paired - true
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons, method = "t.test",
paired = TRUE, size = 2)
# these don't look the same as the values in the paper
# the paper says paired T-test but perhaps I don't have the pairings correct...
# I don't think there is a way to do this without more information
# from the authors...

# can add some stars for significance but as the values aren't the same
# it won't be quite the same as the paper...
# create an object, called plot, with the graph
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons, method = "t.test",
paired = TRUE, size = 2,
symnum.args = list(cutpoints = c(0,0.0001,0.005,0.001, 1),
symbols = c("****","***", "***", "ns"))) +
theme(strip.background = element_rect(fill = FALSE)) -> plot

# show the object
plot

# add titles and reformat them a bit
plot +
labs(x = "Thy Status",
y = "Cytokine levels",
title = "Figure 4a",
subtitle = "Croft et al, Nature, 2019") +
theme(panel.background = element_rect(fill = "white"))

# change the formatting and move legend
plot +
labs(x = "Thy Status",
y = "Cytokine levels",
title = "Figure 4a",
subtitle = "Croft et al, Nature, 2019") +
theme_classic() +
theme(strip.background = element_rect(fill = "white", colour = "white")) +
theme(legend.position="bottom")

# Not perfect but it will do for today...
END

So unlike the published version, I don't have the error bars in the plots. I can't find an easy way to do this with ggpubr. Perhaps I would need to create each graph separately. I also don't have the units described which is a weakness. They vary for the different cytokines. Again making individual graphs is probably necessary.

Playing with the data revealed some issues about the data:

Most of the T test values were as reported but some were not. For example, the published p value for CXCL12 was P = 0.0002 whereas I calculated it as 0.001.
A visual comparison of the IL6 data suggests that the published data is not the same as the data used to create the figure. The published data had three values between 200ng/ml and 400ng/ml. None of the values in the data file were this low.
There was a spelling error in the data file with CXCL12 spelled as "Cxc12"

Some resources:

Monday, 1 July 2019

Exploring data from Nature journal about fibroblast subsets in arthritis...

Recently, I attended a conference entitled Data-driven Systems Medicine at Cardiff University. An interesting meeting organised by Dr Barbara Szomolay & Dr Tom Connor who work in the Systems Immunity Institute.

Prof. Mark Coles, Professor at Kennedy Institute of Rheumatology at Oxford University talked about a recent paper published in Nature. The title of the paper was "Distinct fibroblast subsets drive inflammation and damage in arthritis" by Croft et al . A Biorxiv of the manuscript is also available. Nature requires authors to make a "Data availability statement" which encourages authors to share their data.

This gave me the opportunity to explore some of the data in R. It was relatively straight forward workflow: import, tidy and visualise. To make a figure exactly like the paper takes a bit more effort.

To quote from the paper:

"Rheumatoid arthritis (RA) is a prototypic IMID6 in which synovial fibroblasts contribute to both joint damage and inflammation. We found that expression of FAPα, a cell-membrane dipeptidyl peptidase10, was significantly higher in both synovial tissue and cultured synovial fibroblasts isolated from patients who fulfilled classification criteria for RA compared to patients in whom joint inflammation resolved (Fig. 1a–c), suggesting that FAPα expression may associate with a pathogenic fibroblast phenotype."

Here is a version of Figure 1b which shows FAPalpha in patient samples with different types of disease. It a quantification of staining of cells in synovial tissue.

The code to make this is below as is the code for a version of Figure 1c and 1j.

### START of script to explore data in Figure 1b, 1c and 1j

# the packages that we need

library(readxl)

library(dplyr)

library(ggplot2)

library(tidyr)

library(ggpubr)

# It's possible to explore the data

# This is an Excel file for Figure 1 - each part a different worksheet

link <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1263-7/MediaObjects/41586_2019_1263_MOESM8_ESM.xlsx"

# the download.file() function downloads and saves the file with the name given

download.file(url=link, destfile="file.xlsx", mode="wb")

# check worksheet names

sheet_names <- excel_sheets("file.xlsx")

# import the data for Figure 1b

data <- read_excel("file.xlsx", sheet_names[1])

# data in wide format...

# put into long tidy format... with gather() from dplyr

data %>% gather(na.rm = TRUE) %>%

# make patient samples factors - important for graphing and stats

mutate(key = factor(key, c("Control", "Resolving", "RA"))) -> data_3

# First plot - a type of dot plot

ggplot(data_3, aes(key, value)) +

geom_point()

# geom_boxplot shows the statistics - mean and quartiles

# geom_jitter() shows points off centre across the x-axis...

plot <- ggplot(data_3, aes(key, value)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1)

plot

# we want to put a p-value on the graph

# using the stat_compare_means() function from ggpubr package

my_comparisons <- list( c("Control", "RA"))

plot + ylim(0,0.2) +

stat_compare_means(data = data_3, comparisons = my_comparisons)

# We want to replace the number with four stars as per the paper...

plot <- plot + ylim(0,0.2) +

stat_compare_means(data = data_3, comparisons = my_comparisons,

symnum.args = list(cutpoints = c(0,0.0005, 1),

symbols = c("****","ns")))

# and add labels

plot + labs(x = "",

y = "FAPalpha expression (pixels per unit area)",

title = "Figure 1b",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none")

# not the exact same but very similar to the paper

# Data 1c is very similar in format.

# Here is a full script to show it...

data_1c <- read_excel("file.xlsx", sheet_names[2])

data_1c %>% gather(na.rm = TRUE) %>%

mutate(key = factor(key, c("Control", "Resolving", "RA"))) %>%

ggplot(aes(key, value, shape = key, colour=key)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1) +

ylim(0,2) +

labs(x = "",

y = "Fap mRNA expression",

title = "Figure 1c",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none") +

stat_compare_means(comparisons = my_comparisons,

symnum.args = list(cutpoints = c(0,0.0005, 1),

symbols = c("****","ns")))

# Data 1j is another dot plot...

sheet_names[5]

data_1j <- read_excel("file.xlsx", sheet_names[5])

data_1j %>%

gather(na.rm = TRUE) %>%

mutate(key = factor(key, c("Control", "STIA"))) %>%

ggplot(aes(key, value, shape = key, colour=key)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1) +

ylim(0,0.25) +

labs(x = "",

y = "Bioluminescence",

title = "Figure 1j",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none") +

stat_compare_means()

## END

Some resources:

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Tuesday, 10 March 2020

Exploring some corona virus time courses...

Wednesday, 15 January 2020

Numerate biochemists can make good passwords :-)

Tuesday, 27 August 2019

Exploring cytokine data from fibroblasts with ggpubr...

Monday, 1 July 2019

Exploring data from Nature journal about fibroblast subsets in arthritis...