R for Biochemists: Using the Uniprot API to illustrate protein domains...

Update (2nd Nov 2017): I've updated the drawProteins package and it's totally using ggplot so this script doesn't work. For more details see this blog post, please.

I'm a member of the Cardiff R User group and we're using R to explore APIs - application programming interfaces which allow R to access data. I chose the explore the API that allows me to download data from Uniprot, "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".

Tomorrow, we have our show and tell about what we have learned about using APIs with R. So in preparation and inspired by Stef Locke, the lead organiser for the Cardiff R User group, I have now created my first R package. The R package, called drawProteins, uses the API output to create a schematic protein domains.

Here are two examples of the output. The first is a schematic of a single protein using Base R plotting (script below). The second example is a schematic of three proteins generated using ggplot2 (script in a little while).

Here is the script to plot a schematic of the transcription factor p65 (top panel):

SCRIPT START
# OK so I have to write some kind of a script to use the new
# Uniprot API
# This is R-UserGroup homework for Thursday....

# references: https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/search
# references: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx237

# needs the package httr
# install.packages("httr")
library(httr)

# my package with functions for extracting and graphing
# might require: install.packages("devtools")
devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)

# so I want to use the Protein API
uniprot_acc <- c("Q04206") # change this for your fav protein

# Get UniProt entry by accession
acc_uniprot_url <- c("https://www.ebi.ac.uk/proteins/api/proteins?accession=")

comb_acc_api <- paste0(acc_uniprot_url, uniprot_acc)

# basic function is GET() which accesses the API
# requires internet access
protein <- GET(comb_acc_api,
accept_json())

status_code(protein) # returns a 200 means it worked

# use content() function from httr to give us a list
protein_json <- content(protein) # gives a Large list
# with 14 primary parts and lots of bits inside

# function from my package to extract names of protein
names <- drawProteins::extract_names(protein_json)

# I like a visualistion so I want to draw the features of this molecule
# https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=Q04206

# Get features
feat_api_url <- c("https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=")

comb_acc_api <- paste0(feat_api_url, uniprot_acc)

# basic function is GET() which accesses the API
prot_feat <- GET(comb_acc_api,
accept_json())

status_code(prot_feat) # returns a 200.

# so to process this into a data.frame
prot_feat %>%
content() %>%
flatten() %>%
drawProteins::extract_feat_acc() ->
features

# clean up for plotting
# focus on those that we want to plot...
features_plot <- features[features$length > 0, ]
features_plot <- features_plot[complete.cases(features_plot),]
features_plot <- features_plot[features_plot$type!= "CONFLICT",]

# add colours
library(randomcoloR) # might need install
colours <- distinctColorPalette(nrow(features_plot)-1)
features_plot$col <- c("white", colours)

# now draw this... with BaseR
# here is a function that does that....
drawProteins::draw_mol_horiz(names, features_plot)

# add phosphorylation sites
# get sites first.
features %>%
drawProteins::phospho_site_info() %>%
drawProteins::draw_phosphosites(5, "yellow") # 5 circle radius
text(250,0, "Yellow circles = phosphorylation sites", cex=0.8)

SCRIPT END

Resources

Stef's 3 lines of code to make an R package: https://itsalocke.com/the-making-of-datasaurus/
Another useful blog about writing a package:
http://tinyheero.github.io/jekyll/update/2015/07/26/making-your-first-R-package.html
Key reference: Hadley Wickham's book on Packages
Another R script exploring proteins: Drawing a simple phylogenetic tree of the human rel homology domain family

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Wednesday, 28 June 2017

Using the Uniprot API to illustrate protein domains...

No comments:

Post a Comment