Last week, I submitted my drawProteins R package to Bioconductor. One of the recommended things to do as a Bioconductor developer is to ensure that the package can work with other Bioconductor packages.
I've created this blog post to show how it's possible to use bioMart to pull out the UniProt accession numbers for the Gene Ontology (GO) term, "MAP kinase activity". This has a GO number of GO:0004707.
My script borrows heavily on the biomaRt users guide written by Steffen Durinck, Wolfgang Huber, Mike Smith. My thanks to them.
Using the R script below, I created this protein schematic:
START
# install bioMart if you haven't used it before
# remove the hash tags...
# source("http://www.bioconductor.org/biocLite.R ")
# biocLite()
# biocLite("biomaRt")
library(biomaRt)
# steps
# 1 choose a Mart with useMart()
# 2 choose a dataset with useMart()
# 3. make query with getBM()
# with
# A. filter - restriction on the query
# e.g. Interpro ID(s) [e.g. IPR000001]
# e.g. chromosome_name
# B. attributes - values we are interested in to retrieve.
# C. values - values you want to
listMarts()
# from https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
# chosing a database = MART and a dataset - gets more focussed each step...
ensembl = useMart("ensembl",
dataset="hsapiens_gene_ensembl")
# Retrieve all entrezgene identifiers and HUGO gene symbols of genes which have
# a “MAP kinase activity” GO term associated with it.
# this is the GO:0004707
getBM(attributes = c('entrezgene','hgnc_symbol'),
filters = 'go',
values = 'GO:0004707',
mart = ensembl)
# this is 14 proteins....
# create output in a dataframe and add uniprotswissprot
# which is the UniProt ID
output <- getBM(attributes = c('uniprotswissprot',
'hgnc_symbol'),
filters = 'go',
values = 'GO:0004707',
mart = ensembl)
# returns a dataframe... pull out uniprotIDs for drawing...
uniprotIDs <- output$uniprotswissprot
# get rid of blank entries - turn into NA
uniprotIDs[uniprotIDs==""] <- NA
# remove NA
uniprotIDs <- na.omit(uniprotIDs)
# make the IDs characters
uniprotIDs <- as.character(uniprotIDs)
# just the unique ones
uniprotIDs <- unique(uniprotIDs)
# combine into one element
uniprotIDs <- paste(uniprotIDs, collapse = " ")
# this can now be used in drawProteins
# now get features from Uniprot
library(magrittr)
# install drawProteins from Github
# devtools::install_github("brennanpincardiff/drawProteins")
library(drawProteins)
uniprotIDs %>%
drawProteins::get_features() %>%
drawProteins::feature_to_dataframe() ->
prot_data
# data frame with 722 observations
library(ggplot2)
# basic drawing
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 4)
# background and y-axis
p <- p + theme_bw(base_size = 20) + # white background and change text size
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())
# add titles
p <- p + labs(title = "Schematic of human MAP kinases",
subtitle = "circles = phosphorylation sites\nsource:Uniprot")
# move legend to top
p <- p + theme(legend.position="top") + labs(fill="")
p
# AND it works!!
Resources and citations:
- Using drawProteins for draw NF-kappaB proteins
- More detail here: Drawing NFkappaB protein schematics with ggplot2
- bioMart home page on Bioconductor
- The biomaRt users guide
- Durinck, Steffen , Paul T. Spellman, Ewan Birney and Wolfgang Huber "Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt." (2009) Nature Protocols 4, 1184-1191
- Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21, 3439-3440 (2005).
No comments:
Post a Comment
Comments and suggestions are welcome.