R for Biochemists: June 2017

Wednesday, 28 June 2017

Using the Uniprot API to illustrate protein domains...

Update (2nd Nov 2017): I've updated the drawProteins package and it's totally using ggplot so this script doesn't work. For more details see this blog post, please.

I'm a member of the Cardiff R User group and we're using R to explore APIs - application programming interfaces which allow R to access data. I chose the explore the API that allows me to download data from Uniprot, "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".

Tomorrow, we have our show and tell about what we have learned about using APIs with R. So in preparation and inspired by Stef Locke, the lead organiser for the Cardiff R User group, I have now created my first R package. The R package, called drawProteins, uses the API output to create a schematic protein domains.

Here are two examples of the output. The first is a schematic of a single protein using Base R plotting (script below). The second example is a schematic of three proteins generated using ggplot2 (script in a little while).

Here is the script to plot a schematic of the transcription factor p65 (top panel):

SCRIPT START
# OK so I have to write some kind of a script to use the new
# Uniprot API
# This is R-UserGroup homework for Thursday....

# references: https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/search
# references: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx237

# needs the package httr
# install.packages("httr")
library(httr)

# my package with functions for extracting and graphing
# might require: install.packages("devtools")
devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)

# so I want to use the Protein API
uniprot_acc <- c("Q04206") # change this for your fav protein

# Get UniProt entry by accession
acc_uniprot_url <- c("https://www.ebi.ac.uk/proteins/api/proteins?accession=")

comb_acc_api <- paste0(acc_uniprot_url, uniprot_acc)

# basic function is GET() which accesses the API
# requires internet access
protein <- GET(comb_acc_api,
accept_json())

status_code(protein) # returns a 200 means it worked

# use content() function from httr to give us a list
protein_json <- content(protein) # gives a Large list
# with 14 primary parts and lots of bits inside

# function from my package to extract names of protein
names <- drawProteins::extract_names(protein_json)

# I like a visualistion so I want to draw the features of this molecule
# https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=Q04206

# Get features
feat_api_url <- c("https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=")

comb_acc_api <- paste0(feat_api_url, uniprot_acc)

# basic function is GET() which accesses the API
prot_feat <- GET(comb_acc_api,
accept_json())

status_code(prot_feat) # returns a 200.

# so to process this into a data.frame
prot_feat %>%
content() %>%
flatten() %>%
drawProteins::extract_feat_acc() ->
features

# clean up for plotting
# focus on those that we want to plot...
features_plot <- features[features$length > 0, ]
features_plot <- features_plot[complete.cases(features_plot),]
features_plot <- features_plot[features_plot$type!= "CONFLICT",]

# add colours
library(randomcoloR) # might need install
colours <- distinctColorPalette(nrow(features_plot)-1)
features_plot$col <- c("white", colours)

# now draw this... with BaseR
# here is a function that does that....
drawProteins::draw_mol_horiz(names, features_plot)

# add phosphorylation sites
# get sites first.
features %>%
drawProteins::phospho_site_info() %>%
drawProteins::draw_phosphosites(5, "yellow") # 5 circle radius
text(250,0, "Yellow circles = phosphorylation sites", cex=0.8)

SCRIPT END

Resources

Stef's 3 lines of code to make an R package: https://itsalocke.com/the-making-of-datasaurus/
Another useful blog about writing a package:
http://tinyheero.github.io/jekyll/update/2015/07/26/making-your-first-R-package.html
Key reference: Hadley Wickham's book on Packages
Another R script exploring proteins: Drawing a simple phylogenetic tree of the human rel homology domain family

Thursday, 8 June 2017

Making a Namibia choropleth

According to Wikipedia, a choropleth map uses differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a particular quantity in those areas. I thought it would be interesting to make one of these in R for Namibia. I'm going to Namibia on Saturday as part of Project Phoenix, a partnership project between Cardiff University and University of Namibia.

I have used some data about urbanisation in Namibia which I downloaded from the Open Data for Africa website. This data shows some nice variation across the regions of Namibia.

Here is an example of the what can be done:

Here is the code and maps made along the way:

START
## map urbanisation data onto Namibia administrative regions.

## step 1: download the map data

## download the file...
library(sp)
link <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/NAM_adm1.rds"
download.file(url=link, destfile="file.rda", mode="wb")
gadmNab <- readRDS("file.rda")

## step 2: simplify the level of detail...
library(rmapshaper)
gadmNabS <- ms_simplify(gadmNab, keep = 0.01)
plot(gadmNabS)

## step 3: turn it into a ggplot object and plot
library(broom)
library(gpclib)
library(ggplot2)
namMapsimple <- tidy(gadmNabS, region="NAME_1")
ggplot(data = namMapsimple,
aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = "red"))

# check region names
unique(namMapsimple$id)
# the exclamation mark before Karas will cause a problem.
namMapsimple$id <- gsub("!", "", namMapsimple$id)
# check it worked
unique(namMapsimple$id)
# now no exclamation mark

## step 4: let's get our data and bind it...
# download the data about urbanisation
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/learnR/master/data/Namibia_region_urbanRural.csv")
data <- read.csv(text = x)
# exclude whole country
data_regions <- data[!data$region=="Namibia",]
# so extact 2011 data from the data_regions
data_regions2011 <- data_regions[data_regions$Date == 2011, ]
# just extract urban
data_regions2011_urban <- data_regions2011[data_regions2011$residence == "Urban",]
# change colnames so that we can merge by id...
data_col_names <- colnames(data_regions2011_urban)
data_col_names[1] <- c("id")
colnames(data_regions2011_urban) <- data_col_names

# bind the urbanisation data into map with merge
mapNabVals <- merge(namMapsimple,
data_regions2011_urban,
by.y = 'id')

p <- ggplot() + # sometimes don't add any aes...
geom_polygon(data = mapNabVals,
aes(x = long, y = lat,
group = group,
fill = Value))
p

# plot the outlines of the administrative regions
p <- p + geom_path(data=namMapsimple,
aes(x=long, y=lat, group=group))
p

p <- p + scale_fill_gradient("Urban\n pop\n %",
low = "white", high = "red",
limits=c(0, 100))
p

p <- p + coord_map() # plot with correct mercator projection
p

# add titles
p <- p + labs(x = "", # label x-axis
y = "", # label y-axis
title = "Urbanisation in Namibia 2011",
subtitle = "source: http://namibia.opendataforafrica.org")

p + theme_bw()

# maybe get rid of long, lat etc..

p <- p + theme_bw() +
theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p # show the object

## add names of regions
library(rgeos)
# calculate centroids
trueCentroids = gCentroid(gadmNabS,byid=TRUE)
# make a data frame
name_df <- data.frame(gadmNab@data[,'NAME_1'])
centroids_df <- data.frame(trueCentroids)
name_df <- cbind(name_df, centroids_df)
colnames(name_df) <- c("region", "long", "lat")

# label the regions
p <- p + geom_text(data = name_df,
aes(label = region,
x = long,
y = lat), size = 2)
p

# change different colours
p + scale_fill_gradient("Urban\n population\n %",
low = "white", high = "blue")

p + scale_fill_gradient("Urban\n pop\n %",
low = "white", high = "green")

# If you want to save this... export or use the ggave() function
p + ggsave("Namibia_Urbanisation_Map_2011.pdf")

END

Resources:

More maps on RforBiochemists:

I found this stackoverflow answer very useful
Data sources for Namibia

Plotting Country maps in R using GADM data

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Wednesday, 28 June 2017

Using the Uniprot API to illustrate protein domains...

Thursday, 8 June 2017

Making a Namibia choropleth