Tuesday 27 August 2019

Exploring cytokine data from fibroblasts with ggpubr...

EDITED 24 July 2020. Some changes to functions means that this script needed to be changed. It should work again now.

===
Last month, I wrote a blog post exploring some data from an interesting publication by Croft et al about the role of synovial fibroblasts in rheumatoid arthritis. The paper was "Distinct fibroblast subsets drive inflammation and damage in arthritis". A Biorxiv of the manuscript is available.

Today, while preparing teaching materials for R for Biochemists 201, an advanced R course I'm preparing for the Biochemical Society, I have been exploring the data in Figure 4a. The data analysis workflow requires data import, some tidying in Excel and R, summarisation and visualisation. The data in the Excel file was challenging. I have edited it to make it a little easier for me and put it on Github

For the visualisation, I have been using the ggpubr package by Alboukadel Kassambara. ggpubr is a very interesting package that makes adding results from statistical tests to ggplots more straight forward. Alboukadel Kassambara has also created a VERY useful site for learning R.

To go back to the Nature paper by Croft et al and quote from the paper:
"FAPĪ±+THY1− and FAPĪ±+THY1+ cells isolated from day 9 STIA synovia were stimulated and analysed using luminex."

Below is my version of Figure 4a which shows the different cytokine profiles produced. It's not perfect but it's helped my learning a lot today.


Here is the script that makes my version of Figure 4a and some graphs along the way. The data exploration revealed a few issues about the data which I have mentioned at the bottom and along the way. I will email the corresponding and first authors with some questions. This has been a very interesting process. I would like to thank the authors for making the data available and Alboukadel Kassambara for the interesting ggpubr package. 

START
# making figure 4a ...
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
library(ggpubr)

# DATA IMPORT
# read in the data - I have simplified it a bit...
link <- ("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/figure_4a_data_altered.xlsx")
download.file(url=link, destfile="file.xlsx", mode="wb")

data_t <- read_excel("file.xlsx")

# DATA TIDY
# transforming (rotate x and y) the data makes it easier for me...
data_tt <- t(data_t[2:13])
# turn it into a tibble
data_ttt <- as_tibble(data_tt)
# add colunm names - the names for each cytokine
colnames(data_ttt) <- toupper(data_t$Cytokine)

# use gather() function from dplyr package to change to long format 
data_g <- gather(data_ttt[1:12], key = "cytokine")

# need to add thy status - down the whole column
thy <- c("minus", "minus",  "minus",  "minus",  
         "minus", "minus", "plus", "plus", "plus", "plus",
         "plus", "plus")
thy_144 <- c(thy, thy, thy, thy, thy, thy,
             thy, thy, thy, thy, thy, thy)

data_g <- add_column(data_g, thy = thy_144)
# then make it a factor
data_g$thy <-as.factor(data_g$thy)

# order of cytokines from paper
cytokine_order <- c("CXCL14", "CXCL5", "CXCL2", "CXCL12", 
    "CCL2", "CCL7", "CCL5", "CCL12",
    "IL6", "MMP3", "MMP9", "MMP13")
# make these a factor with an order too...
# note: error in original data on the Nature file - "Cxc12" NOT "Cxcl12"
data_g$cytokine <- factor(data_g$cytokine, levels = cytokine_order)

# DATA SUMMARISE
# calculate means and create a new file with the means
data_g %>%
    group_by(cytokine, thy) %>%
    summarise(mean =mean(value)) -> cyto_means

# FIRST VISUALISATION of the means
ggplot(cyto_means, aes(x = thy, 
    y = mean)) + 
    geom_col() + 
    facet_wrap(~cytokine, scales ="free_y")


# MORE COMPLEX VISUALISATION
# let's try adding the points and then the stat tests...
# we go back to the original data 
# without the summarise step
# use stat_summary() to calculate mean and plot the bars...
# and geom_jitter()
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
    stat_summary(fun="mean", geom="bar") +   # calculates means
    scale_fill_manual(values=c("blue", "red")) + 
    facet_wrap(~cytokine, scales ="free_y") +
    geom_jitter() 


# then the stats tests with stat_compare_means()
my_comparisons <- list( c("minus", "plus"))
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
    stat_summary(fun="mean", geom="bar") +
    scale_fill_manual(values=c("blue", "red")) + 
    facet_wrap(~cytokine, scales ="free_y") +
    geom_jitter() +
    stat_compare_means(comparisons = my_comparisons)
# default Wilcoxon test - non parametric
# so values are different


# change method to t.test and paired - true
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
    stat_summary(fun="mean", geom="bar") +
    scale_fill_manual(values=c("blue", "red")) + 
    facet_wrap(~cytokine, scales ="free_y") +
    geom_jitter() +
    stat_compare_means(comparisons = my_comparisons, method = "t.test",
        paired = TRUE, size = 2)
# these don't look the same as the values in the paper
# the paper says paired T-test but perhaps I don't have the pairings correct...
# I don't think there is a way to do this without more information
# from the authors... 


# can add some stars for significance but as the values aren't the same
# it won't be quite the same as the paper... 
# create an object, called plot, with the graph 
ggplot(data_g, aes(x=thy, y=value, fill = thy)) +
    stat_summary(fun="mean", geom="bar") +
    scale_fill_manual(values=c("blue", "red")) + 
    facet_wrap(~cytokine, scales ="free_y") +
    geom_jitter() +
    stat_compare_means(comparisons = my_comparisons, method = "t.test",
        paired = TRUE, size = 2,
        symnum.args = list(cutpoints = c(0,0.0001,0.005,0.001, 1), 
            symbols = c("****","***", "***", "ns"))) +
    theme(strip.background = element_rect(fill = FALSE)) -> plot

# show the object
plot


# add titles and reformat them a bit
plot + 
    labs(x = "Thy Status",
        y = "Cytokine levels", 
        title = "Figure 4a",
        subtitle = "Croft et al, Nature, 2019") +
    theme(panel.background = element_rect(fill = "white"))



# change the formatting and move legend
plot + 
    labs(x = "Thy Status",
        y = "Cytokine levels", 
        title = "Figure 4a",
        subtitle = "Croft et al, Nature, 2019") +
    theme_classic() +
    theme(strip.background = element_rect(fill = "white", colour = "white")) +
    theme(legend.position="bottom")


# Not perfect but it will do for today...
END

So unlike the published version, I don't have the error bars in the plots. I can't find an easy way to do this with ggpubr. Perhaps I would need to create each graph separately. I also don't have the units described which is a weakness. They vary for the different cytokines. Again making individual graphs is probably necessary.

Playing with the data revealed some issues about the data:
  1. Most of the T test values were as reported but some were not. For example, the published p value for CXCL12 was P = 0.0002 whereas I calculated it as 0.001. 
  2. A visual comparison of the IL6 data suggests that the published data is not the same as the data used to create the figure. The published data had three values between 200ng/ml and 400ng/ml. None of the values in the data file were this low. 
  3. There was a spelling error in the data file with CXCL12 spelled as "Cxc12"