R for Biochemists: ggplot

Showing posts with label ggplot. Show all posts

Monday, 18 May 2020

Volcano time line by Tidy Tuesday...

We've been exploring the Tidy Tuesday datasets at our CaRdiff UseR group. It is good to look at a different types of data and the inspiration on Twitter. I tried to create a time line of volcano eruptions but it was a complete mess. Then I found this data visualisation by @ijeamaka_a which inspired me.

––

She very helpfully shared her code through Github so I was able to have a play with her code and use it to make some plots for other parts of the world and time frames.

Here is an example of a plot I have made which has eruptions in Indonesia between 1250 and 2020 with a volcano explosion index greater than 2.

Using this script it is possible to explore other places that have lots of earthquakes like Japan. This is a nice map of the volcanos in Japan.

START
## Inspiration from https://twitter.com/ijeamaka_a/status/1260660823229214724/photo/1
## https://github.com/Ijeamakaanyene/data_visualizations/blob/master/scripts/2020_10_volcanos.Rmd

library(dplyr)
library(tidyr)
library(ggplot2)

# download the data
volcano = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')
eruptions = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')

# exploring the data a bit
## check countries with most eruptions...
erupt_by_country = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(country) %>%
count(sort = TRUE)
head(erupt_by_country)

## what size eruptions are most common...
count_erupt_size = left_join(eruptions, select(volcano, volcano_number, country, subregion),
by = c("volcano_number"="volcano_number")) %>%
group_by(vei) %>%
count(sort = TRUE)
## vei not available for lots of eruptions - historical I guess...
count_erupt_size

# create the bound data_set
all_eruptions_merged = left_join(eruptions, select(volcano, volcano_number,
country, subregion),
by = c("volcano_number"="volcano_number")) %>%
mutate(combo_year = as.numeric(paste(start_year, start_month, sep = ".")))

# allow some customization
## add your country, start year and end year
## and size of eruption from 1 to 7 or zero to see all.
my_country <- "Indonesia"
my_start_year <- 1250
my_end_year <- 2020
size <- 2

# filter merged data set...
eruptions_merged = all_eruptions_merged %>%
filter(country == my_country) %>%
filter(is.na(vei) == FALSE) %>%
filter(start_year > my_start_year) %>%
filter(start_year < my_end_year) %>%
filter(vei > size)

# Using for loop to create data needed to plot a geom_polygon
volcano_polygon_list = list()
years = unlist(eruptions_merged$combo_year)
volcano_ids = unlist(eruptions_merged$volcano_number)
veis = unlist(eruptions_merged$vei)

## There is most likely a better way of doing this.. but I cannot think of it!
## so for this to work on a wider than 20 year window we need to
## change the value for making the pyramid
## 0.25 is approx 1% of the number of decades
pyr_size <- (my_end_year - my_start_year)/100
for(i in 1:length(years)){
volcano_polygon_df = data.frame(
x = c(years[i], years[i] + pyr_size, years[i] + pyr_size*2),
y = c(0, veis[i], 0),
t = rep(volcano_ids[i], 3)
)
volcano_polygon_list[[i]] = volcano_polygon_df
}

# Converting into df and adding subregion information
volcano_polygon_df = volcano_polygon_list %>%
bind_rows() %>%
left_join(., select(eruptions_merged, volcano_number, subregion),
by = c("t" = "volcano_number"))

# create the plot
volcano_timeline = ggplot() +
geom_polygon(data = volcano_polygon_df, aes(x = x, y = y, group = t, fill = subregion),
alpha = 0.75, colour = "black") +
geom_segment(aes(y = 0, yend = 0, x = my_start_year, xend = my_end_year),
size = 1,
colour = "black",
arrow = arrow()) +
scale_x_continuous(limits = c(my_start_year, my_end_year),
expand = c(0.005, 0.005)) +
scale_y_continuous(expand = c(0, 0))


## add titles - based on the values put in...
volcano_timeline <- volcano_timeline +
labs(y = "Explosion Index", x = NULL, fill = NULL,
title = paste0("Volcanic Activity Timeline (vei>", size, ") ",
my_start_year, " to ", my_end_year, " within ", my_country),
subtitle = "Each triangle's height represents the erruptions volcanic explosion index.",
caption = paste0("Source: The Smithsonian Institution\n",
"Visualization: Paul Brennan | @brennanpcardiff adapted from Ijeamaka Anyene | @ijeamaka_a"))

## brown colour theme...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "BrwnYl"))

## Teal Green colour theme give the plot below...
volcano_timeline + theme_bw() +
## separate out the colour scale to allow a change to this
scale_fill_manual(values = rcartocolor::carto_pal(n = 7, name = "TealGrn"))

END

Resources

Tuesday, 13 February 2018

Plotting disease causing myosin variants using drawProteins

My drawProteins package has been accepted by Bioconductor. It's currently available as a development version - a nice conclusion of lots of effort.

Last week, I had my first query about drawProteins from Kevin T. Booth, a graduate student at the University of Iowa. Kevin wanted to plot disease causing protein variants.

Kevin's working on the MYO7A gene (Uniprot ID “Q13402”). Kevin told me that "alterations in MYO7A can result in 3 different phenotypes-Autosomal Dominant non-syndromic hearing loss (DFNA11), Autosomal Recessive non-syndromic hearing loss (DFNB2) or Usher Syndrome Type 1 (USH1B), another type of inherited deafness.

Looking at the Uniprot page for MYO7A gene, it contains a long list of disease associated variants. We created a visualisation of these variants using the geom_point() function to apply a ggplot2 layer to a schematic of the protein.

Here is the script that generated this and a few other images on the way. Feedback welcome.

SCRIPT START
# exploring plotting variants
# using data from Q13402

# download the protein data first...

library(drawProteins)
library(ggplot2)
# pulling down data for Kevin's protein - MYO7A
prot_data <- drawProteins::get_features("Q13402")
# produce data frame
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# make protein schematic
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p

# nice domain structure
# note that it is a long protein > 2000 amino acids.

p <- p + theme_bw(base_size = 14) + # white background
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p

# EXPLORE VARIANTS
# many VARIANTS
# count them...
nrow(prot_data[prot_data$type == "VARIANT",])
# ans = 74

# these can be plotted using geom_point() function
p + geom_point(data = prot_data[prot_data$type == "VARIANT",],
aes(x = begin,
y = order+0.25))

# wanted focus on disease variants and colour the by disease...
# so need to change the info...
# colour or shape by info within it...

# duplicate prot_data
prot_data2 <- prot_data
# duplicate description..
prot_data2$disc2 <- prot_data$description
# remove word "in"
prot_data2$disc2 <- gsub("in ", "", prot_data2$disc2)

# pull out the disease name...
library(stringr)
temptable <- str_split_fixed(prot_data2$disc2, ";|:", 2)
prot_data2 <- cbind(prot_data2, temptable)
# name columns
tempnames <- colnames(prot_data2)
tempnames[11] <- "disease"
colnames(prot_data2) <- tempnames

# add colour argument within aesthetics
p + geom_point(data = prot_data2[prot_data2$type == "VARIANT",],
aes(x = begin,
y = order+0.25,
colour = disease))

# exclude some of the VARIANTS as not disease causing..
# dbSNPs, rare polymorphism and the blank one
prot_data2$disease[prot_data2$disease=="dbSNP"] <- NA
prot_data2$disease[prot_data2$disease==""] <- NA
prot_data2$disease[prot_data2$disease=="rare polymorphism"] <- NA

# clean up variant list for easier legend
disease_var_data <- prot_data2[prot_data2$type == "VARIANT",]
disease_var_data <- na.omit(disease_var_data)
disease_var_data$disease <- gsub("found patients with ",
"", disease_var_data$disease)
disease_var_data$disease <- gsub("found a patient with ",
"", disease_var_data$disease)

# add colour argument within aesthetics
p + geom_point(data = disease_var_data,
aes(x = begin,
y = order+0.25,
colour = disease))

# over plotting is an problem
# enlarge canvas and explore geom_jitter()
p <- p + ylim(0.75, 3)
p <- p + geom_jitter(data = disease_var_data,
aes(x = begin,
y = 2,
colour = disease),
width = 0, height = 0.75,
size = 2)

# add titles
myo7A_subtitle <- paste0("circles = Disease Variants\n",
"source: Uniprot accessed 8 Feb 2018")

p <- p + labs(title = "Schematic of human Unconventional myosin-VIIa",
subtitle = myo7A_subtitle)
p

SCRIPT END

Some Resources

For more help, bug reports or to suggest features

drawProteins on Github
Bioconductor forum for questions
You can also email me like Kevin did - thanks Kevin.

Thursday, 2 November 2017

Using drawProteins for draw NF-kappaB proteins...

Update, 4 Jan 2018: Following advice from Bioconductor, the package magrittr is not integral to drawProteins so needs to be activated separately to use this script.

Update, 3 Nov 2017: I've renamed the functions e.g. geom_chains is now draw_chains and added a new function draw_canvas which allows one to draw without the chain, if you want.

Building on what I did previously, more work done on the drawProteins package allows the simplier script below to draw this protein schematic:

SCRIPT START
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
library(ggplot2)
library(magrittr)

# accession numbers of five NF-kappaB proteins
"Q04206 Q01201 Q04864 P19838 Q00653" %>%
drawProteins::get_features() %>%
drawProteins::feature_to_dataframe() ->
prot_data

p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p <- draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 8)

# background and y-axis
p <- p + theme_bw(base_size = 20) + # white background and change text size
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())

# add titles
p <- p + labs(x = "Amino acid number", # label x-axis
y = "", # label y-axis
title = "Schematic of human NF-kappaB proteins",
subtitle = "circles = phosphorylation sites\nRHD = Rel Homology Domain\nsource:Uniprot")

# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")

p
SCRIPT END

Resources

Github repository for the drawProteins package

Tuesday, 15 August 2017

Drawing NFkappaB protein schematics with ggplot2..

Update (2 Nov 2017): This still works but has been simplified with drawProteins package. See this blog page for the better example.

I've been busy with work and preparing materials for our R for Biochemists 101 online training course for the Biochemical Society. The materials are being tested now and the course should be available in October 2017. Then I had a short holiday...

So, it's taken a little longer than I hoped but here is the code to draw protein schematics with ggplot2. The schematic shows protein schematics of the five human NF-kappaB proteins. Here is the visualisation:

The code uses my drawProteins package which is on Github. This can be installed using the function install_github() from the package devtools. The required syntax is devtools::install_github("brennanpincardiff/drawProteins")

I've gone through the code step by step. It's quite long and I am planning to create some more functions that will shorten the code.

Here is the step by step code that uses drawProteins and ggplot2 to create a protein schematic visualisation. The ggplots along the way are shown too. To see the plots best requires a plot window of about 900 pixels.

# START
# working with rel proteins as I have been studying these for 10 year
# using ggplot2 to draw the plots.
# proteins are aligned to the N-termini
# alternative alignment options would be worth considering

library(httr)
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
library(ggplot2)

# Uniprot accession numbers of five human NF-kappaB family members
proteins_acc <- c("Q04206 Q01201 Q04864 P19838 Q00653")

# accession numbers need to be combined with "%2C" for Uniprot API
proteins_acc_url <- gsub(" ", "%2C", proteins_acc)

# this is the baseurl for a multiple features query
baseurl <- "https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession="

# make url to GET features for multiple accession numbers
url <- paste0(baseurl, proteins_acc_url)

# basic function is GET() which accesses the API
# accept_json() argument gives a JSON object
prots_feat <- GET(url,
accept_json())

status_code(prots_feat) # returns a 200 - good

# extract content() - gives a list with length = number of acc no
prots_feat_red <- httr::content(prots_feat)

################
# loop to work through the API object and convert to data.frame
# probably there is a better way to do this
features_total_plot <- NULL
for(i in 1:length(prots_feat_red)){
# the extract_feat_acc() function takes features into a data.frame
features_temp <- drawProteins::extract_feat_acc(prots_feat_red[[i]])
features_temp$order <- i # this order is needed for plotting later
features_total_plot <- rbind(features_total_plot, features_temp)
}

# look at the data.frame
View(features_total_plot) # 319 observations

### use ggplot2::geom_rect to draw moleucles
### in this script each type of feature is drawn separately.

## step 1: set up the canvas in ggplot (no data added)
# want to put text on the left in the area of negative numbers
plot_start <- -max(features_total_plot$end)*0.2

# need it to be longer than the longest molecule plus a bit...
plot_end <- max(features_total_plot$end) + max(features_total_plot$end)*0.1

p <- ggplot() +
ylim(0.5, max(features_total_plot$order)+0.5) +
xlim(plot_start, plot_end)
p # show the object

## step 2: draw the CHAINS in grey (narrower)
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "CHAIN",],
mapping=aes(xmin=begin,
xmax=end,
ymin=order-0.2,
ymax=order+0.2),
colour = "black",
fill = "grey")
p

## step 3: annotate with names
p <- p + annotate("text", x = -10, y = features_total_plot$order,
label = features_total_plot$entryName,
hjust = 1,
size = 4)
# check the object
p

## step 4: plot domains fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "DOMAIN",],
mapping=aes(xmin=begin,
xmax=end,
ymin=order-0.25,
ymax=order+0.25,
fill=description))
p

# label domains with descriptions
feat_domain <- features_total_plot[features_total_plot$type == "DOMAIN", ]
p <- p + geom_label(data = feat_domain,
aes(x = begin + (end-begin)/2,
y = order,
label = description))
p
# this works well with geom_label()

## step 5 plot motifs fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "MOTIF",],
mapping=aes(xmin=begin,
xmax=end,
ymin=order-0.25,
ymax=order+0.25,
fill=description))

p
# not going to label these as they're in the description and they are small

## step 6 plot repeats fill by description
p <- p + geom_rect(data= features_total_plot[features_total_plot$type == "REPEAT",],
mapping=aes(xmin=begin,
xmax=end,
ymin=order-0.25,
ymax=order+0.25),
fill = "lightblue")
p

# label repeats (for this they are ANK but remove digits)
p <- p + geom_text(data = features_total_plot[features_total_plot$type == "REPEAT",],
aes(x = begin + (end-begin)/2,
y = order,
label = gsub("\\d", "", description)),
size =2)
p

# add titles
p <- p + labs(x = "Amino acid number", # label x-axis
y = "", # label y-axis
title = "Schematic of five human NF-kappaB proteins",
subtitle = "circles = phosphorylation sites\nRHD: Rel Homology Domain\nsource:Uniprot")
p

# white background and remove y-axis
p <- p + theme_bw() + # white background
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p

# add phosphorlation sites
# extract info about sites using phospho_site_info() function
phospho_sites <- drawProteins::phospho_site_info(features_total_plot)

# draw phosphorylation sites on the protein with geom_point()
p <- p + geom_point(data = phospho_sites,
aes(x = begin,
y = order+0.25),
shape = 21,
colour = "black",
fill = "yellow", size = 4, stroke = 1)

# to see it properly you need to zoom...

# I have to admit that I am pleased with this diagram.
# The colours are better, the labels seem to work.

p + ggsave("five_human_nfkappab_proteins_20170814.pdf")
# END

The output looks best as a PDF, I think so that's why I've saved it.
There are lots of options for altering text size and the like.
I am still working on visualising protein schematics in R so if you have feedback, it would be most welcome.

Resources

Another R script exploring rel proteins: Drawing a simple phylogenetic tree of the human rel homology domain family
The Uniprot API http://www.ebi.ac.uk/proteins/api/doc/
Citation for the Uniprot API: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx237
Documentation for geom_rect, geom_label and geom_text which I have used.

Thursday, 8 June 2017

Making a Namibia choropleth

According to Wikipedia, a choropleth map uses differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a particular quantity in those areas. I thought it would be interesting to make one of these in R for Namibia. I'm going to Namibia on Saturday as part of Project Phoenix, a partnership project between Cardiff University and University of Namibia.

I have used some data about urbanisation in Namibia which I downloaded from the Open Data for Africa website. This data shows some nice variation across the regions of Namibia.

Here is an example of the what can be done:

Here is the code and maps made along the way:

START
## map urbanisation data onto Namibia administrative regions.

## step 1: download the map data

## download the file...
library(sp)
link <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/NAM_adm1.rds"
download.file(url=link, destfile="file.rda", mode="wb")
gadmNab <- readRDS("file.rda")

## step 2: simplify the level of detail...
library(rmapshaper)
gadmNabS <- ms_simplify(gadmNab, keep = 0.01)
plot(gadmNabS)

## step 3: turn it into a ggplot object and plot
library(broom)
library(gpclib)
library(ggplot2)
namMapsimple <- tidy(gadmNabS, region="NAME_1")
ggplot(data = namMapsimple,
aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = "red"))

# check region names
unique(namMapsimple$id)
# the exclamation mark before Karas will cause a problem.
namMapsimple$id <- gsub("!", "", namMapsimple$id)
# check it worked
unique(namMapsimple$id)
# now no exclamation mark

## step 4: let's get our data and bind it...
# download the data about urbanisation
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/learnR/master/data/Namibia_region_urbanRural.csv")
data <- read.csv(text = x)
# exclude whole country
data_regions <- data[!data$region=="Namibia",]
# so extact 2011 data from the data_regions
data_regions2011 <- data_regions[data_regions$Date == 2011, ]
# just extract urban
data_regions2011_urban <- data_regions2011[data_regions2011$residence == "Urban",]
# change colnames so that we can merge by id...
data_col_names <- colnames(data_regions2011_urban)
data_col_names[1] <- c("id")
colnames(data_regions2011_urban) <- data_col_names

# bind the urbanisation data into map with merge
mapNabVals <- merge(namMapsimple,
data_regions2011_urban,
by.y = 'id')

p <- ggplot() + # sometimes don't add any aes...
geom_polygon(data = mapNabVals,
aes(x = long, y = lat,
group = group,
fill = Value))
p

# plot the outlines of the administrative regions
p <- p + geom_path(data=namMapsimple,
aes(x=long, y=lat, group=group))
p

p <- p + scale_fill_gradient("Urban\n pop\n %",
low = "white", high = "red",
limits=c(0, 100))
p

p <- p + coord_map() # plot with correct mercator projection
p

# add titles
p <- p + labs(x = "", # label x-axis
y = "", # label y-axis
title = "Urbanisation in Namibia 2011",
subtitle = "source: http://namibia.opendataforafrica.org")

p + theme_bw()

# maybe get rid of long, lat etc..

p <- p + theme_bw() +
theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p # show the object

## add names of regions
library(rgeos)
# calculate centroids
trueCentroids = gCentroid(gadmNabS,byid=TRUE)
# make a data frame
name_df <- data.frame(gadmNab@data[,'NAME_1'])
centroids_df <- data.frame(trueCentroids)
name_df <- cbind(name_df, centroids_df)
colnames(name_df) <- c("region", "long", "lat")

# label the regions
p <- p + geom_text(data = name_df,
aes(label = region,
x = long,
y = lat), size = 2)
p

# change different colours
p + scale_fill_gradient("Urban\n population\n %",
low = "white", high = "blue")

p + scale_fill_gradient("Urban\n pop\n %",
low = "white", high = "green")

# If you want to save this... export or use the ggave() function
p + ggsave("Namibia_Urbanisation_Map_2011.pdf")

END

Resources:

More maps on RforBiochemists:

I found this stackoverflow answer very useful
Data sources for Namibia

Plotting Country maps in R using GADM data

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points