Last week, I had my first query about drawProteins from Kevin T. Booth, a graduate student at the University of Iowa. Kevin wanted to plot disease causing protein variants.
Kevin's working on the MYO7A gene (Uniprot ID “Q13402”). Kevin told me that "alterations in MYO7A can result in 3 different phenotypes-Autosomal Dominant non-syndromic hearing loss (DFNA11), Autosomal Recessive non-syndromic hearing loss (DFNB2) or Usher Syndrome Type 1 (USH1B), another type of inherited deafness.
Looking at the Uniprot page for MYO7A gene, it contains a long list of disease associated variants. We created a visualisation of these variants using the geom_point() function to apply a ggplot2 layer to a schematic of the protein.
Here is the script that generated this and a few other images on the way. Feedback welcome.
SCRIPT START
# exploring plotting variants
# using data from Q13402
# download the protein data first...
library(drawProteins)
library(ggplot2)
# pulling down data for Kevin's protein - MYO7A
prot_data <- drawProteins::get_features("Q13402")
# produce data frame
prot_data <- drawProteins::feature_to_dataframe(prot_data)
# make protein schematic
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p
# nice domain structure
# note that it is a long protein > 2000 amino acids.
p <- p + theme_bw(base_size = 14) + # white background
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p
# EXPLORE VARIANTS
# many VARIANTS
# count them...
nrow(prot_data[prot_data$type == "VARIANT",])
# ans = 74
# these can be plotted using geom_point() function
p + geom_point(data = prot_data[prot_data$type == "VARIANT",],
aes(x = begin,
y = order+0.25))
# wanted focus on disease variants and colour the by disease...
# so need to change the info...
# colour or shape by info within it...
# duplicate prot_data
prot_data2 <- prot_data
# duplicate description..
prot_data2$disc2 <- prot_data$description
# remove word "in"
prot_data2$disc2 <- gsub("in ", "", prot_data2$disc2)
# pull out the disease name...
library(stringr)
temptable <- str_split_fixed(prot_data2$disc2, ";|:", 2)
prot_data2 <- cbind(prot_data2, temptable)
# name columns
tempnames <- colnames(prot_data2)
tempnames[11] <- "disease"
colnames(prot_data2) <- tempnames
# add colour argument within aesthetics
p + geom_point(data = prot_data2[prot_data2$type == "VARIANT",],
aes(x = begin,
y = order+0.25,
colour = disease))
# exclude some of the VARIANTS as not disease causing..
# dbSNPs, rare polymorphism and the blank one
prot_data2$disease[prot_data2$disease=="dbSNP"] <- NA
prot_data2$disease[prot_data2$disease==""] <- NA
prot_data2$disease[prot_data2$disease=="rare polymorphism"] <- NA
# clean up variant list for easier legend
disease_var_data <- prot_data2[prot_data2$type == "VARIANT",]
disease_var_data <- na.omit(disease_var_data)
disease_var_data$disease <- gsub("found patients with ",
"", disease_var_data$disease)
disease_var_data$disease <- gsub("found a patient with ",
"", disease_var_data$disease)
# add colour argument within aesthetics
p + geom_point(data = disease_var_data,
aes(x = begin,
y = order+0.25,
colour = disease))
# over plotting is an problem
# enlarge canvas and explore geom_jitter()
p <- p + ylim(0.75, 3)
p <- p + geom_jitter(data = disease_var_data,
aes(x = begin,
y = 2,
colour = disease),
width = 0, height = 0.75,
size = 2)
# add titles
myo7A_subtitle <- paste0("circles = Disease Variants\n",
"source: Uniprot accessed 8 Feb 2018")
p <- p + labs(title = "Schematic of human Unconventional myosin-VIIa",
subtitle = myo7A_subtitle)
p
SCRIPT END
Some Resources
For more help, bug reports or to suggest features
- drawProteins on Github
- Bioconductor forum for questions
- You can also email me like Kevin did - thanks Kevin.