A site to help Biochemists learn R.

Starting points

Friday, 15 February 2019

Bar chart of common mental disorders...

My day job in the School of Medicine at Cardiff University involves facilitating learning around various medical conditions including mental health. I like a few statistics so I have been exploring the prevalence of mental health disorders. I found a report about mental health from Our World in Data which shares all the data it uses on Github - making it open source. There is lots of interesting data.
As well as mental health, there is data and reports about cancer and the burden of disease.

Inspired by the mental health report from Our World in Data, I downloaded some data and generated a graph which shows the prevalence of Mental Health Disorders in the UK.

Here is the graph:





Here is the R script that generated the graph and a few other graph along the way.

===  START ===
# looking at some mental health data...
# source: https://ourworldindata.org/mental-health

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# download the data from Github
data <- read_csv("https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Mental%20health%20prevalence%20(IHME)/Mental%20health%20prevalence%20(IHME).csv")

# pull out data for UK and wrangle using pipes and dplyr
data %>% 
    # filter() by country and year
    filter(Entity == "United Kingdom", Year == 2016) %>%
    # select() prevalence - percentage 3rd to 13th column
    select(3:13) %>%
    # turn from wide format to long for better plotting using gather()
    gather(key = "CMHD", value = "prevalence") -> data1

# now have new object data1

# first bar chart...
ggplot(data1, aes(x = CMHD, y = prevalence)) +
    geom_bar(stat = "identity")


# plot horizontally with coord_flip()
ggplot(data1, aes(x = CMHD, y = prevalence)) +
    geom_bar(stat = "identity") +
    coord_flip()


# remove the text "- both sexes (percent)" gsub() function
data1$CMHD <- gsub(" \\- both sexes \\(percent\\)", "", data1$CMHD)
# the \\ are escape characters for minus and brackets 

# AND

# reorder the categories as factors by size of prevalence
# https://www.reed.edu/data-at-reed/resources/R/reordering_geom_bar.html
data1$CMHD <- factor(data1$CMHD, levels = data1$CMHD[order(data1$prevalence)])

p <- ggplot(data1, aes(x = CMHD, y = prevalence)) +
    geom_bar(stat = "identity") +
    coord_flip()
p


# add some labels and source....
p <- p +
    theme_bw() +
    labs(x = "",
        y = "Prevalence (%)",
        title = "Prevalence of Common Mental Health Disorders in UK (2016)", 
        subtitle = "https://ourworldindata.org/mental-health")
p


# Our World in Data website has the numbers on the plot...
p <- p +
    geom_text(aes(label=round(prevalence, 2)))
p


# Our World website has different coloured bars on the plot...
# by altering fill in the aes() of ggplot
p <- ggplot(data1, aes(x = CMHD, y = prevalence, fill = CMHD)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme_bw() +
    labs(x = "",
        y = "Prevalence (%)",
        title = "UK Prevalence of Common Mental Health Disorders (2016)", 
        subtitle = "https://ourworldindata.org/mental-health") +
    geom_text(aes(label=round(prevalence, 1)))
p


#  Which adds a legend... so remove the legend...
p + theme(legend.position="none")
=== END ===

Some resources:

Monday, 14 January 2019

Making a box and whisker plot with some published proteomic data...

I'm preparing some teaching materials for another Biochemical Society R training event with the draft title of R for Biochemists 201. Some more advanced material based on feedback for participants of R for Biochemists 101.
In preparation, I've been looking at published proteomics data. I've come across a nice paper by a group in the Barts Cancer Institute in London. The paper is entitled "Proteomic and genomic integration identifies kinase and differentiation determinants of kinase inhibitor sensitivity in leukemia cells". It was published in the journal Leukaemia.
I visited their lab once many years ago and I have heard the senior author, Pedro Cutillas, talk.  It is very interesting work, I think.
They have made their data available so I've spend some time writing a script that makes one part of Figure 1a - a nice box and whisker plot.

Here is the plot:



and here is the script:
START
## data import
library(readxl)
library(ggplot2)
link <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41375-018-0032-1/MediaObjects/41375_2018_32_MOESM2_ESM.xlsx"
download.file(link, "temp_data")
data <- read_excel("temp_data", skip=2) 
# skip = 2 stops the first two rows being part of the file
# the next row is used as titles of the columns
# skip needs to be determined by looking at the data

# remove bottom two rows as only 36 patients
data <- data[1:36,]

# check the structure of column 18 that we want to plot later...
str(data$`MEKi (trametinib)__1`)
# it is chr = character because of "ND" not determined in the file

# read in data again with na argument
data <- read_excel("temp_data", skip=2, na = "ND") 
data <- data[1:36,]
# the na argument means that the text "ND" will be converted to NA 
str(data$`MEKi (trametinib)__1`) # num now GOOD

# using the geom_boxplot() function, we can draw our graph
ggplot(data = data,
    aes(FAB, log10(`MEKi (trametinib)__1`), colour = FAB)) +
    geom_boxplot(na.rm = TRUE)

# we can make it look a bit more like the plot in the paper using geom_jitter()
plot <- ggplot(data = data,
    aes(FAB, log10(`MEKi (trametinib)__1`), colour = FAB)) +
    geom_boxplot(na.rm = TRUE) +
    geom_jitter(width=0.15, na.rm = TRUE) +
    theme_bw() +
    labs( y = "Log10(EC50)nM",
        title = "Sensitivity of AML patients samples to MEK inhibitor", 
        subtitle = "Casado et al (2018) Leukaemia 32:1818–1822
doi:10.1038/s41375-018-0032-1") +
    theme(legend.position="none")
plot

# to print a high resolution of this the tiff() function can be used. 
tiff("plot.tiff", height = 12, width = 17, units = 'cm', compression = "lzw", res = 300)
plot
dev.off()
END


Some resources:

Monday, 20 August 2018

Exploring more immunization data...

Last week, inspired by Factfulness, I made a graph showing the BCG immunization coverage for children at 1 year. The Factfulness graph didn't mention any specific immunization programme and World Health Organisation data monitors immunization coverage for other vaccines. For that reason, I thought it would be good to download and graph more of the available global immunization data.

This allowed me to generate this graph which shows immunization coverage across the world for nine different vaccines. The various dates that monitoring starts shows that new immunization programs are being rolled out on a regular basis - good to see.




Below is the code for downloading the data and making the graph...
If you would rather just make the graph with some cleaner data, the data from Aug 20, 2018 is available on github and can be downloaded using the read_csv() code shown about half way down the script.

## START
##  download the data  
library(tidyverse)
# install.packages("WHO")
library(WHO)

# check out the codes of the WHO data...
codes <- get_codes()
# get codes for immunizations
immun_codes <- codes[grepl("[Ii]mmuniz", codes$display), ]
immun_codes$label

# go through each of the 18 to find global data...
# Number 1 has global data

# download number 1
# requires internet access
immun_data <- as.tibble(get_data(immun_codes$label[1]))

# filter for global data
immun_data <- filter(immun_data, region == "(WHO) Global")

# repeat download for next 2 to 18 WHO codes  
# had to do this as a loop as couldn't get it to work using lapply...
# start off with second value as first is above..
# requires internet access and patience...
for(i in 2:length(immun_codes$label)){
    # download the data
    data <- as.tibble(get_data(immun_codes$label[i])) 
    
    #tell you that it has downloaded...
    print(paste("Dataset",immun_codes$label[i], "downloaded."))
    
    # filter the data for Global values
    data <- filter(data, region == "(WHO) Global")
    
    # if there is some data bind_rows()
    if(nrow(data)>1){
        # bind_rows() function from dplyr
        immun_data <- bind_rows(immun_data, data)
    }else{   
        # if not just tell us....
        print("No global data in this set")
    }
}





# reduce columns using select() function  
immun_data <- select(immun_data, gho, region, year, value )

# to avoid having to download every time... save a local copy
file_name <- paste0("global_immun_data", Sys.Date())
write_csv(immun_data, file_name)

## ----read_back if you have saved to continue from here
# immun_data <- read_csv(file_name)

# read in data from github using read_csv() function

# immun_data <- read_csv("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/global_immun_data2018-08-20")


# Let's make our plot...
plot <- ggplot(immun_data, aes(x = year, y = value, 
    colour = gho)) +
  geom_line(size = 1)+ 
  theme(legend.position="none")

plot

# separate the plots with facet wrap
plotf <- plot + facet_wrap(~gho)
plotf


## The individual graph titles are difficult to read
# Shorten them by removing text using gsub() = global substitution
immun_data$gho_s <- gsub("immunization coverage among 1-year-olds",
                            "", immun_data$gho)
immun_data$gho_s <- gsub("immunization coverage by the nationally recommended age",
                        "", immun_data$gho_s)

# make the plot again
plot <- ggplot(immun_data, aes(x = year, y = value, 
    colour = gho_s)) +
  geom_line(size = 1) + 
  theme(legend.position="none")

# separate plots with facet_wrap
plotf <- plot + facet_wrap(~gho_s)
plotf <- plotf + theme_bw() + theme(legend.position="none")
plotf

# improve plot with y limits, titles & source
source <- paste("Source: World Health Organisation, accessed:", Sys.Date())
plotf <- plotf + ylim(0,100)
plotf <- plotf + labs(x = NULL, y = "Immunization Rate",
      title = "Global immunization rates", 
    subtitle = source)
plotf



## END

Some Resources:



Friday, 10 August 2018

Exploring some immunization data... one graph from Factfulness..

I've been reading Factfulness - an important book about statistics, data visualization and generally understanding the world. Written by Hans Rosling with Ola Rosling and Anna Rosling Rönnlund of Gapminder, it uses data to try to improve our world view as many of us have an incorrect world view. It is based on the very popular videos by Hans Rosling. Here is one to have a look at...


I like to explore data and there is lots of data in the book. One of the things I like to do is to try to reproduce the graph from publications. So here is a graph from Factfulness:

It shows how the world has improved in terms of immunization. Most of us don't know that almost 90% of children worldwide are immunized.  Here is a graph for just the TB vaccine - data for BCG immunization uptake for children. It's similar to the graph above and it's made in R.



The code uses the WHO R package which allows us to access immunization data from the Global Health Observatory data repository.

Here is the code that makes the graph and some of the steps along the way.

START 

library(tidyverse)
# install.packages("WHO")
library(WHO)
# check out the codes of the WHO data...
# requires internet access
codes <- get_codes()
colnames(codes)



glimpse(codes)  # useful function from tibble package

codes[1:10, 2]

# get codes for immunizations
# search using grepl() function for finding regular expressions
codes[grepl("[Ii]mmuniz", codes$display), ]

# ----downloadBCG data
bcg_data <- get_data("WHS4_543")  # BCG data...
# requires internet access and takes a few minutes...

## ----lookatBCG data
glimpse(bcg_data)

summary(bcg_data)

## ----globalGraph
# extract Global data with the filter() function from dplyr package
# then plot...
bcg_data %>% 
    filter(region == "(WHO) Global") %>% 
    ggplot(aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "BCG Immunization Rates")





## To make the graph look more like Factfulness...
# 1. Add a title
# 2. Set limits to 0 and 100%
# 3. Add source
# 4. Add an point at the start
# 5. Add an arrow at the end
# 6. Add values and years to start and end...
# 7. To change the style to grey under the arrow

## add titles and limits to y axis...
bcg_data %>% 
    filter(region == "(WHO) Global") %>% 
    ggplot(aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "BCG Immunization Rates", 
        title = "BCG Immunization Rates") + 
    ylim(0,100) -> bcg_plot
bcg_plot





# add a source
source <- paste("Source: World Health Organisation \n accessed:",     Sys.Date())
bcg_plot + annotate("text",  x = 2008, y = 10, label=source, size=3)




## Add points and annotation requires more work....
# pull out global data
bcg_data %>% 
    filter(region == "(WHO) Global") -> glob_bcg
# identify minimum and maximum point and labels
min_point <- filter(glob_bcg, year == min(glob_bcg$year))
min_point_label <- paste0(min_point$value, "%\n", min_point$year)
max_point <- filter(glob_bcg, year == max(glob_bcg$year))
max_point_label <- paste0(max_point$value, "%\n", max_point$year)

# Use WHO title - better plan
graph_title <- glob_bcg$gho[1]

# put it together to make a graph
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "Immunization Rate",
        title = graph_title) + 
    ylim(0,100) +
    geom_point(data = min_point, aes(year, value)) +
    geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
    geom_point(data = max_point, aes(year, value), 
        shape = 62, size = 5) +
    geom_text(data = max_point, label=max_point_label, vjust=1.2)

bcg_plot




## add grey underneath the line
# this uses geom_ribbon()
bcg_plot + geom_ribbon(aes(ymin=0, ymax=value), 
    fill = "#DCDCDC") # this colour is a nice grey. 



# works but overlays earlier annotation. 

## ----change order to make it all work together for final plot
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
    geom_ribbon(aes(ymin=0, ymax=value), fill = "#DCDCDC") +
    geom_line(size = 1) +
    theme_bw() +
    labs(x = NULL, y = "Immunization Rate",
        title = graph_title) + 
    ylim(0,100) +
    geom_point(data = min_point, aes(year, value), size = 5) +
    geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
    geom_point(data = max_point, aes(year, value), 
        shape = 62, size = 8) +
    geom_text(data = max_point, label=max_point_label, vjust=1.2) +
    annotate("text",  x = 2008, y = 10, label=source, size=3)

bcg_plot



END


Some Resources:

Friday, 22 June 2018

Creating a drawProteins hex sticker...

Lots of Bioconductor packages have created hex stickers - those fun stickers that go on the top of your laptop. They are a bit nerdy but, hey, who doesn't like nerdy 😀

I thought I should create a hex sticker for drawProteins since it's now available in Bioconductor.

It's been created totally in R using the hexSticker package from the talented GuangchuangYu.

Here is the sticker:


Below is the R code that makes the sticker. I've chosen to show three of NF-kappaB proteins but the code could be adapted to illustrate your favourite protein. You can also customise the sticker colour and lots of other things.

START
# script to create drawProteins sticker
# setwd("/Users/paulbrennan/Documents/drawProteins_hexsticker")
library(ggplot2)
# install.packages("hexSticker")
library(hexSticker)
library(drawProteins)

data("five_rel_data")
p <- draw_canvas(five_rel_data)
p <- draw_chains(p, five_rel_data, label_chains = FALSE, size =0.2)
p <- draw_domains(p, five_rel_data,
  show.legend = FALSE,
  label_size = 2)
p

p <- draw_phospho(p, five_rel_data)
p <- p  + theme_void()

sticker(p, package="drawProteins", p_size=6, s_x=.98,
  s_y=.88, s_width=1.4, s_height=1.1,h_fill = "#114466", h_color="#001030")


# this doesn't quite work :-(
# size of text is wrong... no labels... cramped!
# I think five proteins is probably too many...

# reduce to three....
# P19838
# Q04206
# Q04864

# get from UniProt rather than using internal data
prot_data <- drawProteins::get_features("Q04206 Q04864 P19838")

# turn the data into a dataframe
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# create canvas
p <- draw_canvas(prot_data)

# customise the labels
p <- draw_chains(p, prot_data, label_size = 1.4,
  labels = c("c-Rel",
    "p50/p105",
    "p50/p105",
    "p65/Rel A"),
  size=0.2)

# draw domains without a legend or labels
p <- draw_domains(p, prot_data, show.legend = FALSE, label_domains = FALSE)

# add text and control size with geom_text rather than labels from drawProteins
p <- p + ggplot2::geom_text(data = prot_data[prot_data$type == "DOMAIN", ],
  ggplot2::aes(x = begin + (end-begin)/2,
               y = order,
               label = description),
           size = 1.3)

# keep phosphorylation sites small
p <- draw_phospho(p, prot_data, size = 1)

# a completely blank theme using theme_void()
p <- p  + theme_void()
p

# make the sticker using the ggplot object entitled p
# creates a png file
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org")
# automatically creates the PNG file with the name drawProteins.


# make the PDF
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org",
  filename="drawProteins.pdf")

END


Some Resources:

For more help, bug reports or to suggest features


Tuesday, 13 February 2018

Plotting disease causing myosin variants using drawProteins

My drawProteins package has been accepted by Bioconductor. It's currently available as a development version -  a nice conclusion of lots of effort.

Last week, I had my first query about drawProteins from Kevin T. Booth, a graduate student at the University of Iowa. Kevin wanted to plot disease causing protein variants.

Kevin's working on the MYO7A gene (Uniprot ID “Q13402”). Kevin told me that "alterations in MYO7A can result in 3 different phenotypes-Autosomal Dominant non-syndromic hearing loss (DFNA11),  Autosomal Recessive non-syndromic hearing loss (DFNB2) or Usher Syndrome Type 1 (USH1B), another type of inherited deafness.

Looking at the Uniprot page for MYO7A gene, it contains a long list of disease associated variants. We created a visualisation of these variants using the geom_point() function to apply a ggplot2 layer to a schematic of the protein.




Here is the script that generated this and a few other images on the way. Feedback welcome.

SCRIPT START
# exploring plotting variants
# using data from Q13402

# download the protein data first...

library(drawProteins)
library(ggplot2)
# pulling down data for Kevin's protein - MYO7A
prot_data <- drawProteins::get_features("Q13402")
# produce data frame
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# make protein schematic
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p





# nice domain structure
# note that it is a long protein > 2000 amino acids. 

p <- p + theme_bw(base_size = 14) + # white background
    theme(panel.grid.minor=element_blank(), 
        panel.grid.major=element_blank()) +
    theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
    theme(panel.border = element_blank())
p




# EXPLORE VARIANTS
# many VARIANTS
# count them...
nrow(prot_data[prot_data$type == "VARIANT",])
# ans = 74

# these can be plotted using geom_point() function
p + geom_point(data = prot_data[prot_data$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25))


# wanted focus on disease variants and colour the by disease...
# so need to change the info...
# colour or shape by info within it...

# duplicate prot_data
prot_data2 <- prot_data
# duplicate description..
prot_data2$disc2 <- prot_data$description
# remove word "in"
prot_data2$disc2 <- gsub("in ", "", prot_data2$disc2)

# pull out the disease name...
library(stringr)
temptable <- str_split_fixed(prot_data2$disc2, ";|:", 2)
prot_data2 <- cbind(prot_data2, temptable)
# name columns
tempnames <- colnames(prot_data2)
tempnames[11] <- "disease"
colnames(prot_data2) <- tempnames

# add colour argument within aesthetics
p + geom_point(data = prot_data2[prot_data2$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25,
        colour = disease))



# exclude some of the VARIANTS as not disease causing..
# dbSNPs, rare polymorphism and the blank one 
prot_data2$disease[prot_data2$disease=="dbSNP"] <- NA
prot_data2$disease[prot_data2$disease==""] <- NA
prot_data2$disease[prot_data2$disease=="rare polymorphism"] <- NA

# clean up variant list for easier legend
disease_var_data <- prot_data2[prot_data2$type == "VARIANT",]
disease_var_data <- na.omit(disease_var_data)
disease_var_data$disease <- gsub("found patients with ", 
    "", disease_var_data$disease)
disease_var_data$disease <- gsub("found a patient with ", 
    "", disease_var_data$disease)

# add colour argument within aesthetics
p + geom_point(data = disease_var_data,
    aes(x = begin,
        y = order+0.25,
        colour = disease))

# over plotting is an problem 
# enlarge canvas and explore geom_jitter()
p <- p + ylim(0.75, 3)
p <- p + geom_jitter(data = disease_var_data,
            aes(x = begin,
                y = 2,
                colour = disease),
            width = 0, height = 0.75, 
            size = 2)


# add titles
myo7A_subtitle <- paste0("circles = Disease Variants\n",
    "source: Uniprot accessed 8 Feb 2018")

p <- p + labs(title = "Schematic of human Unconventional myosin-VIIa",
    subtitle = myo7A_subtitle)
p


SCRIPT END

Some Resources

For more help, bug reports or to suggest features