R for Biochemists: 2019

Tuesday 27 August 2019

Exploring cytokine data from fibroblasts with ggpubr...

EDITED 24 July 2020. Some changes to functions means that this script needed to be changed. It should work again now.

===
Last month, I wrote a blog post exploring some data from an interesting publication by Croft et al about the role of synovial fibroblasts in rheumatoid arthritis. The paper was "Distinct fibroblast subsets drive inflammation and damage in arthritis". A Biorxiv of the manuscript is available.

Today, while preparing teaching materials for R for Biochemists 201, an advanced R course I'm preparing for the Biochemical Society, I have been exploring the data in Figure 4a. The data analysis workflow requires data import, some tidying in Excel and R, summarisation and visualisation. The data in the Excel file was challenging. I have edited it to make it a little easier for me and put it on Github.

For the visualisation, I have been using the ggpubr package by Alboukadel Kassambara. ggpubr is a very interesting package that makes adding results from statistical tests to ggplots more straight forward. Alboukadel Kassambara has also created a VERY useful site for learning R.

To go back to the Nature paper by Croft et al and quote from the paper:

"FAPα+THY1− and FAPα+THY1+ cells isolated from day 9 STIA synovia were stimulated and analysed using luminex."

Below is my version of Figure 4a which shows the different cytokine profiles produced. It's not perfect but it's helped my learning a lot today.

Here is the script that makes my version of Figure 4a and some graphs along the way. The data exploration revealed a few issues about the data which I have mentioned at the bottom and along the way. I will email the corresponding and first authors with some questions. This has been a very interesting process. I would like to thank the authors for making the data available and Alboukadel Kassambara for the interesting ggpubr package.

START

# making figure 4a ...
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
library(ggpubr)

# DATA IMPORT
# read in the data - I have simplified it a bit...
link <- ("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/figure_4a_data_altered.xlsx")
download.file(url=link, destfile="file.xlsx", mode="wb")

data_t <- read_excel("file.xlsx")

# DATA TIDY
# transforming (rotate x and y) the data makes it easier for me...
data_tt <- t(data_t[2:13])
# turn it into a tibble
data_ttt <- as_tibble(data_tt)
# add colunm names - the names for each cytokine
colnames(data_ttt) <- toupper(data_t$Cytokine)

# use gather() function from dplyr package to change to long format
data_g <- gather(data_ttt[1:12], key = "cytokine")

# need to add thy status - down the whole column
thy <- c("minus", "minus", "minus", "minus",
"minus", "minus", "plus", "plus", "plus", "plus",
"plus", "plus")
thy_144 <- c(thy, thy, thy, thy, thy, thy,
thy, thy, thy, thy, thy, thy)

data_g <- add_column(data_g, thy = thy_144)
# then make it a factor
data_g$thy <-as.factor(data_g$thy)

# order of cytokines from paper
cytokine_order <- c("CXCL14", "CXCL5", "CXCL2", "CXCL12",
"CCL2", "CCL7", "CCL5", "CCL12",
"IL6", "MMP3", "MMP9", "MMP13")
# make these a factor with an order too...
# note: error in original data on the Nature file - "Cxc12" NOT "Cxcl12"
data_g$cytokine <- factor(data_g$cytokine, levels = cytokine_order)

# DATA SUMMARISE
# calculate means and create a new file with the means
data_g %>%
group_by(cytokine, thy) %>%
summarise(mean =mean(value)) -> cyto_means

# FIRST VISUALISATION of the means
ggplot(cyto_means, aes(x = thy,
y = mean)) +
geom_col() +
facet_wrap(~cytokine, scales ="free_y")

# MORE COMPLEX VISUALISATION
# let's try adding the points and then the stat tests...
# we go back to the original data
# without the summarise step
# use stat_summary() to calculate mean and plot the bars...
# and geom_jitter()
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") + # calculates means
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter()

# then the stats tests with stat_compare_means()
my_comparisons <- list( c("minus", "plus"))
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons)
# default Wilcoxon test - non parametric
# so values are different

# change method to t.test and paired - true
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons, method = "t.test",
paired = TRUE, size = 2)
# these don't look the same as the values in the paper
# the paper says paired T-test but perhaps I don't have the pairings correct...
# I don't think there is a way to do this without more information
# from the authors...

# can add some stars for significance but as the values aren't the same
# it won't be quite the same as the paper...
# create an object, called plot, with the graph
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
stat_summary(fun="mean", geom="bar") +
scale_fill_manual(values=c("blue", "red")) +
facet_wrap(~cytokine, scales ="free_y") +
geom_jitter() +
stat_compare_means(comparisons = my_comparisons, method = "t.test",
paired = TRUE, size = 2,
symnum.args = list(cutpoints = c(0,0.0001,0.005,0.001, 1),
symbols = c("****","***", "***", "ns"))) +
theme(strip.background = element_rect(fill = FALSE)) -> plot

# show the object
plot

# add titles and reformat them a bit
plot +
labs(x = "Thy Status",
y = "Cytokine levels",
title = "Figure 4a",
subtitle = "Croft et al, Nature, 2019") +
theme(panel.background = element_rect(fill = "white"))

# change the formatting and move legend
plot +
labs(x = "Thy Status",
y = "Cytokine levels",
title = "Figure 4a",
subtitle = "Croft et al, Nature, 2019") +
theme_classic() +
theme(strip.background = element_rect(fill = "white", colour = "white")) +
theme(legend.position="bottom")

# Not perfect but it will do for today...
END

So unlike the published version, I don't have the error bars in the plots. I can't find an easy way to do this with ggpubr. Perhaps I would need to create each graph separately. I also don't have the units described which is a weakness. They vary for the different cytokines. Again making individual graphs is probably necessary.

Playing with the data revealed some issues about the data:

Most of the T test values were as reported but some were not. For example, the published p value for CXCL12 was P = 0.0002 whereas I calculated it as 0.001.
A visual comparison of the IL6 data suggests that the published data is not the same as the data used to create the figure. The published data had three values between 200ng/ml and 400ng/ml. None of the values in the data file were this low.
There was a spelling error in the data file with CXCL12 spelled as "Cxc12"

Some resources:

Monday 1 July 2019

Exploring data from Nature journal about fibroblast subsets in arthritis...

Recently, I attended a conference entitled Data-driven Systems Medicine at Cardiff University. An interesting meeting organised by Dr Barbara Szomolay & Dr Tom Connor who work in the Systems Immunity Institute.

Prof. Mark Coles, Professor at Kennedy Institute of Rheumatology at Oxford University talked about a recent paper published in Nature. The title of the paper was "Distinct fibroblast subsets drive inflammation and damage in arthritis" by Croft et al . A Biorxiv of the manuscript is also available. Nature requires authors to make a "Data availability statement" which encourages authors to share their data.

This gave me the opportunity to explore some of the data in R. It was relatively straight forward workflow: import, tidy and visualise. To make a figure exactly like the paper takes a bit more effort.

To quote from the paper:

"Rheumatoid arthritis (RA) is a prototypic IMID6 in which synovial fibroblasts contribute to both joint damage and inflammation. We found that expression of FAPα, a cell-membrane dipeptidyl peptidase10, was significantly higher in both synovial tissue and cultured synovial fibroblasts isolated from patients who fulfilled classification criteria for RA compared to patients in whom joint inflammation resolved (Fig. 1a–c), suggesting that FAPα expression may associate with a pathogenic fibroblast phenotype."

Here is a version of Figure 1b which shows FAPalpha in patient samples with different types of disease. It a quantification of staining of cells in synovial tissue.

The code to make this is below as is the code for a version of Figure 1c and 1j.

### START of script to explore data in Figure 1b, 1c and 1j

# the packages that we need

library(readxl)

library(dplyr)

library(ggplot2)

library(tidyr)

library(ggpubr)

# It's possible to explore the data

# This is an Excel file for Figure 1 - each part a different worksheet

link <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1263-7/MediaObjects/41586_2019_1263_MOESM8_ESM.xlsx"

# the download.file() function downloads and saves the file with the name given

download.file(url=link, destfile="file.xlsx", mode="wb")

# check worksheet names

sheet_names <- excel_sheets("file.xlsx")

# import the data for Figure 1b

data <- read_excel("file.xlsx", sheet_names[1])

# data in wide format...

# put into long tidy format... with gather() from dplyr

data %>% gather(na.rm = TRUE) %>%

# make patient samples factors - important for graphing and stats

mutate(key = factor(key, c("Control", "Resolving", "RA"))) -> data_3

# First plot - a type of dot plot

ggplot(data_3, aes(key, value)) +

geom_point()

# geom_boxplot shows the statistics - mean and quartiles

# geom_jitter() shows points off centre across the x-axis...

plot <- ggplot(data_3, aes(key, value)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1)

plot

# we want to put a p-value on the graph

# using the stat_compare_means() function from ggpubr package

my_comparisons <- list( c("Control", "RA"))

plot + ylim(0,0.2) +

stat_compare_means(data = data_3, comparisons = my_comparisons)

# We want to replace the number with four stars as per the paper...

plot <- plot + ylim(0,0.2) +

stat_compare_means(data = data_3, comparisons = my_comparisons,

symnum.args = list(cutpoints = c(0,0.0005, 1),

symbols = c("****","ns")))

# and add labels

plot + labs(x = "",

y = "FAPalpha expression (pixels per unit area)",

title = "Figure 1b",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none")

# not the exact same but very similar to the paper

# Data 1c is very similar in format.

# Here is a full script to show it...

data_1c <- read_excel("file.xlsx", sheet_names[2])

data_1c %>% gather(na.rm = TRUE) %>%

mutate(key = factor(key, c("Control", "Resolving", "RA"))) %>%

ggplot(aes(key, value, shape = key, colour=key)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1) +

ylim(0,2) +

labs(x = "",

y = "Fap mRNA expression",

title = "Figure 1c",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none") +

stat_compare_means(comparisons = my_comparisons,

symnum.args = list(cutpoints = c(0,0.0005, 1),

symbols = c("****","ns")))

# Data 1j is another dot plot...

sheet_names[5]

data_1j <- read_excel("file.xlsx", sheet_names[5])

data_1j %>%

gather(na.rm = TRUE) %>%

mutate(key = factor(key, c("Control", "STIA"))) %>%

ggplot(aes(key, value, shape = key, colour=key)) +

geom_boxplot(outlier.size=0) +

geom_jitter(width = 0.1) +

ylim(0,0.25) +

labs(x = "",

y = "Bioluminescence",

title = "Figure 1j",

subtitle = "Croft et al, Nature, 2019") +

theme_classic() + theme(legend.position="none") +

stat_compare_means()

## END

Some resources:

Tuesday 16 April 2019

An icon plot inspired by bacon and a new book....

Last Saturday, "The Art of Statistics - Learning from Data" by Professor Sir David Spiegelhalter arrived. I ordered it after watching this video entitled "Why statistics should make you suspicious". It is a very interesting book which discusses statistics, data visualisation, data science and the challenges of teaching statistics.

I like to see if I can reproduce figures from papers and books, so I spent most of yesterday trying to reproduce Figure 1.4 with R. I have been partially successful. I am pretty sure I could have made it more quickly with Powerpoint, but then I wouldn't have learned anything about R. Here is the image from the book (I hope it is OK to reproduce this... I will contact to ask permission):

It's an icon plot or an icon array. They can be used to communicate risk.

Here is my attempt to reproduce the icon array using the ggimage and ggwaffle packages:

It's not perfect but it's the best I can do at the moment. I'm not happy with the separation of the rows of icons. The final images are very large and cause R-Studio some problems in terms of speed of rendering. However, I've learned a lot.

A much quicker and easier method uses the personograph package. The types of icon are very restricted - only male icons :-( Also the random distribution of black icons of the original image is not possible. This is supposed to show the random nature of disease which I quite like. The images are very quick to render and the code is easy to understand.

Here are the images made with personograph:

Figure 1.4

Bacon sandwich example using a pair of icon arrays. Of 100 people who do not eat

bacon, 6 (black icons) develop bowel cancer in the normal run of events (top panel).

Of 100 people who eat bacon every day of their lives, there is 1 additional (red) case.

I've also worked with the waffle package as shown in the code below.

Here is all the R code:

## START
# installing the packages, first remove the hash tag to run:

# install.packages("personograph") # easiest way to start

# https://github.com/liamgilbey/ggwaffle
# devtools::install_github("liamgilbey/ggwaffle")

# install.packages("waffle", "readr", "ggpubr")

# https://github.com/GuangchuangYu/ggimage
# setRepositories(ind=1:2)
# install.packages("ggimage")

# first choice of package - nice and easy but limited customisation
library(personograph)
# https://github.com/joelkuiper/personograph
# https://cran.r-project.org/web/packages/personograph/index.html

# the data is supplied in a list and is plotted in order.
data <- list(first=0.06, second=0.94)
personograph(data, colors=list(first="black", second="#efefef"),
fig.title = "100 people who do not eat bacon",
draw.legend = FALSE, dimensions=c(5,20))

data_2 <- list(first=0.06, second=0.01, third=0.93)
personograph(data_2, colors=list(first="black", second="red", third="#efefef"),
fig.title = "100 people who eat bacon every day",
draw.legend = FALSE, dimensions=c(5,20))

# icons all male... no random distribution...

# second package: waffle
library(waffle)
dont_eat_bacon <- c('Cancer' = 6, 'No cancer' = 94)
waffle(dont_eat_bacon, rows = 5, colors = c("#000000", "#efefef"),
legend_pos = "bottom", title = "100 people who do not eat bacon")

eat_bacon <- c('Cancer' = 6, 'Extra case' = 1, 'No cancer' = 93)
waffle(eat_bacon, rows = 5, colors = c("#000000", "#f90000","#efefef"),
legend_pos = "bottom", title = "100 people who eat bacon every day")

# nice clear colours but difficult to change symbols
# without installing fonts into system...
# I would rather not have to install fonts...

# Try a third package - ggwaffle

library(readr)
library(ggwaffle)
library(ggpubr)
library(ggimage)
theme_set(theme_pubr())

# in this case, we basically encode every point as a graph.
# download the data
link <- ("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/ggwaffledata_mf.csv")
bacon_waf <- read_csv(link)

# have a look at data
View(bacon_waf)

# basic plot with geom_waffle() from ggwaffle package
ggplot(bacon_waf, aes(x, y, fill = bacon)) +
geom_waffle()

# add icons with geom_icon() from ggimage package
p1 <- ggplot(bacon_waf, aes(x, y, colour = no_bacon)) +
geom_icon(aes(image=icon), size = 0.1) +
scale_color_manual(values=c("black", "grey")) +
theme_waffle() +
theme(legend.position = "none") +
labs(x = "", y = "",
title = "100 people who do not eat bacon")

# show the plot
# this is VERY SLOW to draw...
# because it contains 100 icons each of which gets
# downloaded from the internet.
# I feel sure there is a better way but I don't know it at the moment...
p1

p2 <- ggplot(bacon_waf, aes(x, y, colour = bacon)) +
geom_icon(aes(image=icon), size = 0.1) +
scale_color_manual(values=c("black","red", "grey")) +
theme_waffle() +
theme(legend.position = "none") +
labs(x = "", y = "",
title = "100 people who eat bacon every day")

# this is the text at the bottom of the page
text <- paste("Figure 1.4\n",
"Bacon sandwich example using a pair of icon arrays, with randomly",
"scattered icons showing the incremental risk of eating bacon every",
"day. Of 100 people who do not eat bacon, 6 (solid icons) develop",
"bowel cancer in the normal run of events. Of 100 people who eat",
"bacon every day of their lives, there is 1 additional (red) case.",
sep = " ")

# format the text as ggplot object
# with ggparagraph() from the ggpubr package
text_p <- ggparagraph(text = text, size = 12, color = "black")

# arrange the images and text with ggarrange() from the ggpubr package
together <- ggarrange(p1, p2, text_p,
ncol = 1, nrow = 3,
heights = c(1, 1, 0.3))

together
# AGAIN VERY SLOW...!!

# ggsave("together_3.pdf", together)
# works but these are very large PDF file and
# take a lot of time to render

# at the end it seems good to clear memory...

gc()

## END

Some resources:

Reproducing immunization data from Factfulness - another good book from Hans Rosling.

Spiegelhalter, DJ (2008) "Understanding Uncertainty" Ann Fam Med 6:196-197 doi: 10.1370/afm.848

Galesic, M & Garcia-Retamero, R (2009) "Using Icon Arrays to Communicate Medical Risks: Overcoming Low Numeracy" Health Psychology 2009, Vol. 28, No. 2, 210 –216

"Infographic-style charts using the R waffle package" by N Saunders

Friday 15 February 2019

Bar chart of common mental disorders...

My day job in the School of Medicine at Cardiff University involves facilitating learning around various medical conditions including mental health. I like a few statistics so I have been exploring the prevalence of mental health disorders. I found a report about mental health from Our World in Data which shares all the data it uses on Github - making it open source. There is lots of interesting data.
As well as mental health, there is data and reports about cancer and the burden of disease.

Inspired by the mental health report from Our World in Data, I downloaded some data and generated a graph which shows the prevalence of Mental Health Disorders in the UK.

Here is the graph:

Here is the R script that generated the graph and a few other graph along the way.

=== START ===
# looking at some mental health data...
# source: https://ourworldindata.org/mental-health

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# download the data from Github
data <- read_csv("https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Mental%20health%20prevalence%20(IHME)/Mental%20health%20prevalence%20(IHME).csv")

# pull out data for UK and wrangle using pipes and dplyr
data %>%
# filter() by country and year
filter(Entity == "United Kingdom", Year == 2016) %>%
# select() prevalence - percentage 3rd to 13th column
select(3:13) %>%
# turn from wide format to long for better plotting using gather()
gather(key = "CMHD", value = "prevalence") -> data1

# now have new object data1

# first bar chart...
ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity")

# plot horizontally with coord_flip()
ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity") +
coord_flip()

# remove the text "- both sexes (percent)" gsub() function
data1$CMHD <- gsub(" \\- both sexes \$percent\$", "", data1$CMHD)
# the \\ are escape characters for minus and brackets

# AND

# reorder the categories as factors by size of prevalence
# https://www.reed.edu/data-at-reed/resources/R/reordering_geom_bar.html
data1$CMHD <- factor(data1$CMHD, levels = data1$CMHD[order(data1$prevalence)])

p <- ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity") +
coord_flip()
p

# add some labels and source....
p <- p +
theme_bw() +
labs(x = "",
y = "Prevalence (%)",
title = "Prevalence of Common Mental Health Disorders in UK (2016)",
subtitle = "https://ourworldindata.org/mental-health")
p

# Our World in Data website has the numbers on the plot...
p <- p +
geom_text(aes(label=round(prevalence, 2)))
p

# Our World website has different coloured bars on the plot...
# by altering fill in the aes() of ggplot
p <- ggplot(data1, aes(x = CMHD, y = prevalence, fill = CMHD)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() +
labs(x = "",
y = "Prevalence (%)",
title = "UK Prevalence of Common Mental Health Disorders (2016)",
subtitle = "https://ourworldindata.org/mental-health") +
geom_text(aes(label=round(prevalence, 1)))
p

# Which adds a legend... so remove the legend...
p + theme(legend.position="none")

=== END ===

Some resources:

Exploring diseases in Wales for SQL Relay...
A bar chart for Athena SWAN awards
ggplot documentation for geom_bar()
http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
Another source of data: https://www.mentalhealth.org.uk/statistics

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Tuesday 27 August 2019

Exploring cytokine data from fibroblasts with ggpubr...

Monday 1 July 2019

Exploring data from Nature journal about fibroblast subsets in arthritis...

Tuesday 16 April 2019

An icon plot inspired by bacon and a new book....

Friday 15 February 2019

Bar chart of common mental disorders...