R for Biochemists: ggplot2

Showing posts with label ggplot2. Show all posts

Thursday, 23 April 2020

Analysing an ELISA standard curve...

As part of R for Biochemists 101, the training programme created for the Biochemical Society, one of the participants, Abigail Byford (a British Heart Foundation 4-year PhD student from Leeds Institute of Cardiovascular and Metabolic Medicine) shared some of her data. She performed an ELISA for human chorionic gonadotropin (hCG) - a pregnancy hormone.

An ELISA generated colour that is read using a 96 well plate spectrophotometer that measures absorbance at a specific wavelength.

This script illustrates working with 96 well plate data, working with a standard curve and calculating unknowns.

--- START ---
library(readxl)
library(ggplot2)
library(dplyr)

# import data from Excel file no Github
# this comes straight from the spectrophotometer
link <- "https://github.com/brennanpincardiff/RforBiochemists/raw/master/data/hCG_absorbances.xlsx"

download.file(url=link, destfile="hCG_absorbances.xlsx", mode="wb")

hcg_file <- read_xlsx("hCG_absorbances.xlsx")

wavelength <- hcg_file[17,2]
days <- hcg_file[5,2]
# Excel stores the number of days since Jan-0-1900
# https://stackoverflow.com/questions/25158969/read-xlsx-reading-dates-wrong-if-non-date-in-column
# adding days to a date:
# https://stackoverflow.com/questions/10322035/r-adding-days-to-a-date
date <- as.Date("1900-01-01") + as.numeric(days) - 2
a

# this allows us a first look at the file...
# there is lots of metadata at the top which we don't need right now..

# we can skip at least the first 23 rows

abs_data <- read_xlsx("hCG_absorbances.xlsx", skip = 23)

# standard curve values are:
hcg <- c(5, 50, 200, 500, 1000, 5, 50, 200, 500, 1000)

# these correspond to A to E in columns 1 and 2
# we can pull these out with subsetting [rows, colums]
abs_data[1:5,2] # gives first set of standards

# using c() and unlist() will turn these into a vector
abs_std <- c(unlist(abs_data[1:5,2]), unlist(abs_data[1:5,3]))

# plot the data
plot(abs_std~hcg)

# create the dataframe
stand_curve <- data.frame(hcg, abs_std)

# draw the graph...
ggplot(stand_curve, aes(x = hcg, y = abs_std))+
geom_point() +
geom_smooth()
# line begins to top out at high absorbance - assay limitation

# plot and make the linear plot with just the first three values
# where hcg is less than 250
# here is the data...
stand_curve %>%
filter(hcg<250) %>%
ggplot(aes(x = hcg, y = abs_std)) +
geom_point() +
stat_smooth(method = "lm", formula = y~x) +
xlab("hCG") +
ylab(paste("Absorbance", as.character(wavelength))) +
ggtitle(paste("hCG Standard Curve \n", date))

stand_curve <- filter(stand_curve, hcg<250)
line <- lm(stand_curve$abs_std ~ stand_curve$hcg)

# extract intercept and slope from our line object
int <- line$coefficients[1]
slope <- line$coefficients[2]

# now calculate the unknowns
# R will calculate the whole data frame for us
# and put it into another data frame.
# abs_data[,2:11] - subsets the numbers we want.
# round() reduces the number of decimal points.
hCG_Ukns <- round((abs_data[,2:11] - int)/slope, 1)

hCG_Ukns
# some of the unknowns are below the lowest standard but hey...
--- END ---

Resources

Tuesday, 16 April 2019

An icon plot inspired by bacon and a new book....

Last Saturday, "The Art of Statistics - Learning from Data" by Professor Sir David Spiegelhalter arrived. I ordered it after watching this video entitled "Why statistics should make you suspicious". It is a very interesting book which discusses statistics, data visualisation, data science and the challenges of teaching statistics.

I like to see if I can reproduce figures from papers and books, so I spent most of yesterday trying to reproduce Figure 1.4 with R. I have been partially successful. I am pretty sure I could have made it more quickly with Powerpoint, but then I wouldn't have learned anything about R. Here is the image from the book (I hope it is OK to reproduce this... I will contact to ask permission):

It's an icon plot or an icon array. They can be used to communicate risk.

Here is my attempt to reproduce the icon array using the ggimage and ggwaffle packages:

It's not perfect but it's the best I can do at the moment. I'm not happy with the separation of the rows of icons. The final images are very large and cause R-Studio some problems in terms of speed of rendering. However, I've learned a lot.

A much quicker and easier method uses the personograph package. The types of icon are very restricted - only male icons :-( Also the random distribution of black icons of the original image is not possible. This is supposed to show the random nature of disease which I quite like. The images are very quick to render and the code is easy to understand.

Here are the images made with personograph:

Figure 1.4

Bacon sandwich example using a pair of icon arrays. Of 100 people who do not eat

bacon, 6 (black icons) develop bowel cancer in the normal run of events (top panel).

Of 100 people who eat bacon every day of their lives, there is 1 additional (red) case.

I've also worked with the waffle package as shown in the code below.

Here is all the R code:

## START
# installing the packages, first remove the hash tag to run:

# install.packages("personograph") # easiest way to start

# https://github.com/liamgilbey/ggwaffle
# devtools::install_github("liamgilbey/ggwaffle")

# install.packages("waffle", "readr", "ggpubr")

# https://github.com/GuangchuangYu/ggimage
# setRepositories(ind=1:2)
# install.packages("ggimage")

# first choice of package - nice and easy but limited customisation
library(personograph)
# https://github.com/joelkuiper/personograph
# https://cran.r-project.org/web/packages/personograph/index.html

# the data is supplied in a list and is plotted in order.
data <- list(first=0.06, second=0.94)
personograph(data, colors=list(first="black", second="#efefef"),
fig.title = "100 people who do not eat bacon",
draw.legend = FALSE, dimensions=c(5,20))

data_2 <- list(first=0.06, second=0.01, third=0.93)
personograph(data_2, colors=list(first="black", second="red", third="#efefef"),
fig.title = "100 people who eat bacon every day",
draw.legend = FALSE, dimensions=c(5,20))

# icons all male... no random distribution...

# second package: waffle
library(waffle)
dont_eat_bacon <- c('Cancer' = 6, 'No cancer' = 94)
waffle(dont_eat_bacon, rows = 5, colors = c("#000000", "#efefef"),
legend_pos = "bottom", title = "100 people who do not eat bacon")

eat_bacon <- c('Cancer' = 6, 'Extra case' = 1, 'No cancer' = 93)
waffle(eat_bacon, rows = 5, colors = c("#000000", "#f90000","#efefef"),
legend_pos = "bottom", title = "100 people who eat bacon every day")

# nice clear colours but difficult to change symbols
# without installing fonts into system...
# I would rather not have to install fonts...

# Try a third package - ggwaffle

library(readr)
library(ggwaffle)
library(ggpubr)
library(ggimage)
theme_set(theme_pubr())

# in this case, we basically encode every point as a graph.
# download the data
link <- ("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/ggwaffledata_mf.csv")
bacon_waf <- read_csv(link)

# have a look at data
View(bacon_waf)

# basic plot with geom_waffle() from ggwaffle package
ggplot(bacon_waf, aes(x, y, fill = bacon)) +
geom_waffle()

# add icons with geom_icon() from ggimage package
p1 <- ggplot(bacon_waf, aes(x, y, colour = no_bacon)) +
geom_icon(aes(image=icon), size = 0.1) +
scale_color_manual(values=c("black", "grey")) +
theme_waffle() +
theme(legend.position = "none") +
labs(x = "", y = "",
title = "100 people who do not eat bacon")

# show the plot
# this is VERY SLOW to draw...
# because it contains 100 icons each of which gets
# downloaded from the internet.
# I feel sure there is a better way but I don't know it at the moment...
p1

p2 <- ggplot(bacon_waf, aes(x, y, colour = bacon)) +
geom_icon(aes(image=icon), size = 0.1) +
scale_color_manual(values=c("black","red", "grey")) +
theme_waffle() +
theme(legend.position = "none") +
labs(x = "", y = "",
title = "100 people who eat bacon every day")

# this is the text at the bottom of the page
text <- paste("Figure 1.4\n",
"Bacon sandwich example using a pair of icon arrays, with randomly",
"scattered icons showing the incremental risk of eating bacon every",
"day. Of 100 people who do not eat bacon, 6 (solid icons) develop",
"bowel cancer in the normal run of events. Of 100 people who eat",
"bacon every day of their lives, there is 1 additional (red) case.",
sep = " ")

# format the text as ggplot object
# with ggparagraph() from the ggpubr package
text_p <- ggparagraph(text = text, size = 12, color = "black")

# arrange the images and text with ggarrange() from the ggpubr package
together <- ggarrange(p1, p2, text_p,
ncol = 1, nrow = 3,
heights = c(1, 1, 0.3))

together
# AGAIN VERY SLOW...!!

# ggsave("together_3.pdf", together)
# works but these are very large PDF file and
# take a lot of time to render

# at the end it seems good to clear memory...

gc()

## END

Some resources:

Reproducing immunization data from Factfulness - another good book from Hans Rosling.

Spiegelhalter, DJ (2008) "Understanding Uncertainty" Ann Fam Med 6:196-197 doi: 10.1370/afm.848

Galesic, M & Garcia-Retamero, R (2009) "Using Icon Arrays to Communicate Medical Risks: Overcoming Low Numeracy" Health Psychology 2009, Vol. 28, No. 2, 210 –216

"Infographic-style charts using the R waffle package" by N Saunders

Friday, 15 February 2019

Bar chart of common mental disorders...

My day job in the School of Medicine at Cardiff University involves facilitating learning around various medical conditions including mental health. I like a few statistics so I have been exploring the prevalence of mental health disorders. I found a report about mental health from Our World in Data which shares all the data it uses on Github - making it open source. There is lots of interesting data.
As well as mental health, there is data and reports about cancer and the burden of disease.

Inspired by the mental health report from Our World in Data, I downloaded some data and generated a graph which shows the prevalence of Mental Health Disorders in the UK.

Here is the graph:

Here is the R script that generated the graph and a few other graph along the way.

=== START ===
# looking at some mental health data...
# source: https://ourworldindata.org/mental-health

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# download the data from Github
data <- read_csv("https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Mental%20health%20prevalence%20(IHME)/Mental%20health%20prevalence%20(IHME).csv")

# pull out data for UK and wrangle using pipes and dplyr
data %>%
# filter() by country and year
filter(Entity == "United Kingdom", Year == 2016) %>%
# select() prevalence - percentage 3rd to 13th column
select(3:13) %>%
# turn from wide format to long for better plotting using gather()
gather(key = "CMHD", value = "prevalence") -> data1

# now have new object data1

# first bar chart...
ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity")

# plot horizontally with coord_flip()
ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity") +
coord_flip()

# remove the text "- both sexes (percent)" gsub() function
data1$CMHD <- gsub(" \\- both sexes \$percent\$", "", data1$CMHD)
# the \\ are escape characters for minus and brackets

# AND

# reorder the categories as factors by size of prevalence
# https://www.reed.edu/data-at-reed/resources/R/reordering_geom_bar.html
data1$CMHD <- factor(data1$CMHD, levels = data1$CMHD[order(data1$prevalence)])

p <- ggplot(data1, aes(x = CMHD, y = prevalence)) +
geom_bar(stat = "identity") +
coord_flip()
p

# add some labels and source....
p <- p +
theme_bw() +
labs(x = "",
y = "Prevalence (%)",
title = "Prevalence of Common Mental Health Disorders in UK (2016)",
subtitle = "https://ourworldindata.org/mental-health")
p

# Our World in Data website has the numbers on the plot...
p <- p +
geom_text(aes(label=round(prevalence, 2)))
p

# Our World website has different coloured bars on the plot...
# by altering fill in the aes() of ggplot
p <- ggplot(data1, aes(x = CMHD, y = prevalence, fill = CMHD)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() +
labs(x = "",
y = "Prevalence (%)",
title = "UK Prevalence of Common Mental Health Disorders (2016)",
subtitle = "https://ourworldindata.org/mental-health") +
geom_text(aes(label=round(prevalence, 1)))
p

# Which adds a legend... so remove the legend...
p + theme(legend.position="none")

=== END ===

Some resources:

Exploring diseases in Wales for SQL Relay...
A bar chart for Athena SWAN awards
ggplot documentation for geom_bar()
http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
Another source of data: https://www.mentalhealth.org.uk/statistics

Monday, 20 August 2018

Exploring more immunization data...

Last week, inspired by Factfulness, I made a graph showing the BCG immunization coverage for children at 1 year. The Factfulness graph didn't mention any specific immunization programme and World Health Organisation data monitors immunization coverage for other vaccines. For that reason, I thought it would be good to download and graph more of the available global immunization data.

This allowed me to generate this graph which shows immunization coverage across the world for nine different vaccines. The various dates that monitoring starts shows that new immunization programs are being rolled out on a regular basis - good to see.

Below is the code for downloading the data and making the graph...
If you would rather just make the graph with some cleaner data, the data from Aug 20, 2018 is available on github and can be downloaded using the read_csv() code shown about half way down the script.

## START
## download the data
library(tidyverse)
# install.packages("WHO")
library(WHO)

# check out the codes of the WHO data...
codes <- get_codes()
# get codes for immunizations
immun_codes <- codes[grepl("[Ii]mmuniz", codes$display), ]
immun_codes$label

# go through each of the 18 to find global data...
# Number 1 has global data

# download number 1
# requires internet access
immun_data <- as.tibble(get_data(immun_codes$label[1]))

# filter for global data
immun_data <- filter(immun_data, region == "(WHO) Global")

# repeat download for next 2 to 18 WHO codes
# had to do this as a loop as couldn't get it to work using lapply...
# start off with second value as first is above..
# requires internet access and patience...
for(i in 2:length(immun_codes$label)){
# download the data
data <- as.tibble(get_data(immun_codes$label[i]))

#tell you that it has downloaded...
print(paste("Dataset",immun_codes$label[i], "downloaded."))

# filter the data for Global values
data <- filter(data, region == "(WHO) Global")

# if there is some data bind_rows()
if(nrow(data)>1){
# bind_rows() function from dplyr
immun_data <- bind_rows(immun_data, data)
}else{
# if not just tell us....
print("No global data in this set")
}
}

# reduce columns using select() function
immun_data <- select(immun_data, gho, region, year, value )

# to avoid having to download every time... save a local copy
file_name <- paste0("global_immun_data", Sys.Date())
write_csv(immun_data, file_name)

## ----read_back if you have saved to continue from here
# immun_data <- read_csv(file_name)

# read in data from github using read_csv() function

# immun_data <- read_csv("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/global_immun_data2018-08-20")

# Let's make our plot...
plot <- ggplot(immun_data, aes(x = year, y = value,
colour = gho)) +
geom_line(size = 1)+
theme(legend.position="none")

plot

# separate the plots with facet wrap
plotf <- plot + facet_wrap(~gho)
plotf

## The individual graph titles are difficult to read
# Shorten them by removing text using gsub() = global substitution
immun_data$gho_s <- gsub("immunization coverage among 1-year-olds",
"", immun_data$gho)
immun_data$gho_s <- gsub("immunization coverage by the nationally recommended age",
"", immun_data$gho_s)

# make the plot again
plot <- ggplot(immun_data, aes(x = year, y = value,
colour = gho_s)) +
geom_line(size = 1) +
theme(legend.position="none")

# separate plots with facet_wrap
plotf <- plot + facet_wrap(~gho_s)
plotf <- plotf + theme_bw() + theme(legend.position="none")
plotf

# improve plot with y limits, titles & source
source <- paste("Source: World Health Organisation, accessed:", Sys.Date())
plotf <- plotf + ylim(0,100)
plotf <- plotf + labs(x = NULL, y = "Immunization Rate",
title = "Global immunization rates",
subtitle = source)
plotf

## END

Some Resources:

Exploring some immunization data... one graph from Factfulness..

Factfulness: Building Gapminder Income Mountains

Exploring UN data on researchers (a bubble plot)...

Scraping and visualising global vaccination recommendations with R...

Play with Gapminder data online

Bubble plots in ggplot2 from Jenny Bryan

Friday, 10 August 2018

Exploring some immunization data... one graph from Factfulness..

I've been reading Factfulness - an important book about statistics, data visualization and generally understanding the world. Written by Hans Rosling with Ola Rosling and Anna Rosling Rönnlund of Gapminder, it uses data to try to improve our world view as many of us have an incorrect world view. It is based on the very popular videos by Hans Rosling. Here is one to have a look at...

I like to explore data and there is lots of data in the book. One of the things I like to do is to try to reproduce the graph from publications. So here is a graph from Factfulness:

It shows how the world has improved in terms of immunization. Most of us don't know that almost 90% of children worldwide are immunized. Here is a graph for just the TB vaccine - data for BCG immunization uptake for children. It's similar to the graph above and it's made in R.

The code uses the WHO R package which allows us to access immunization data from the Global Health Observatory data repository.

Here is the code that makes the graph and some of the steps along the way.

START

library(tidyverse)
# install.packages("WHO")
library(WHO)
# check out the codes of the WHO data...
# requires internet access
codes <- get_codes()
colnames(codes)

glimpse(codes) # useful function from tibble package

codes[1:10, 2]

# get codes for immunizations
# search using grepl() function for finding regular expressions
codes[grepl("[Ii]mmuniz", codes$display), ]

# ----downloadBCG data
bcg_data <- get_data("WHS4_543") # BCG data...
# requires internet access and takes a few minutes...

## ----lookatBCG data
glimpse(bcg_data)

summary(bcg_data)

## ----globalGraph
# extract Global data with the filter() function from dplyr package
# then plot...
bcg_data %>%
filter(region == "(WHO) Global") %>%
ggplot(aes(x = year, y = value)) +
geom_line(size = 1) +
labs(x = NULL, y = "BCG Immunization Rates")

## To make the graph look more like Factfulness...
# 1. Add a title
# 2. Set limits to 0 and 100%
# 3. Add source
# 4. Add an point at the start
# 5. Add an arrow at the end
# 6. Add values and years to start and end...
# 7. To change the style to grey under the arrow

## add titles and limits to y axis...
bcg_data %>%
filter(region == "(WHO) Global") %>%
ggplot(aes(x = year, y = value)) +
geom_line(size = 1) +
labs(x = NULL, y = "BCG Immunization Rates",
title = "BCG Immunization Rates") +
ylim(0,100) -> bcg_plot
bcg_plot

# add a source
source <- paste("Source: World Health Organisation \n accessed:", Sys.Date())
bcg_plot + annotate("text", x = 2008, y = 10, label=source, size=3)

## Add points and annotation requires more work....
# pull out global data
bcg_data %>%
filter(region == "(WHO) Global") -> glob_bcg
# identify minimum and maximum point and labels
min_point <- filter(glob_bcg, year == min(glob_bcg$year))
min_point_label <- paste0(min_point$value, "%\n", min_point$year)
max_point <- filter(glob_bcg, year == max(glob_bcg$year))
max_point_label <- paste0(max_point$value, "%\n", max_point$year)

# Use WHO title - better plan
graph_title <- glob_bcg$gho[1]

# put it together to make a graph
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
geom_line(size = 1) +
labs(x = NULL, y = "Immunization Rate",
title = graph_title) +
ylim(0,100) +
geom_point(data = min_point, aes(year, value)) +
geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
geom_point(data = max_point, aes(year, value),
shape = 62, size = 5) +
geom_text(data = max_point, label=max_point_label, vjust=1.2)

bcg_plot

## add grey underneath the line
# this uses geom_ribbon()
bcg_plot + geom_ribbon(aes(ymin=0, ymax=value),
fill = "#DCDCDC") # this colour is a nice grey.

# works but overlays earlier annotation.

## ----change order to make it all work together for final plot
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
geom_ribbon(aes(ymin=0, ymax=value), fill = "#DCDCDC") +
geom_line(size = 1) +
theme_bw() +
labs(x = NULL, y = "Immunization Rate",
title = graph_title) +
ylim(0,100) +
geom_point(data = min_point, aes(year, value), size = 5) +
geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
geom_point(data = max_point, aes(year, value),
shape = 62, size = 8) +
geom_text(data = max_point, label=max_point_label, vjust=1.2) +
annotate("text", x = 2008, y = 10, label=source, size=3)

bcg_plot

END

Some Resources:

Exploring UN data on researchers (a bubble plot)...

Scraping and visualising global vaccination recommendations with R...

Play with Gapminder data online

Bubble plots in ggplot2 from Jenny Bryan

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Thursday, 23 April 2020

Analysing an ELISA standard curve...

Tuesday, 16 April 2019

An icon plot inspired by bacon and a new book....

Friday, 15 February 2019

Bar chart of common mental disorders...

Monday, 20 August 2018

Exploring more immunization data...

Some Resources:

Friday, 10 August 2018

Exploring some immunization data... one graph from Factfulness..

Some Resources:

Exploring UN data on researchers (a bubble plot)...

Scraping and visualising global vaccination recommendations with R...

Play with Gapminder data online

Bubble plots in ggplot2 from Jenny Bryan