One of my favourite sources of molecular information is the Uniprot database. According to itself, it is "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".
A little while ago, I showed how to use R to draw a schematic of the receptor of the pro-inflammatory cytokine, TNFR1. This post shows how to scrape the list of interacting proteins from the Uniprot page in order to draw a graph of the number of experiments showing each interaction.
For scraping the webpage, I use the xml version of the page and the base R function readLines(). There are more complex ways to download the page including packages and functions that will read in xml data. I wanted to start simply in the first instance...
Here is the graph:
Here is the script:
# START
library(ggplot2)
# extract interaction data from the TNFR1 Uniprot page
# this is the URL for the TNFR1 Uniprot page
url <- c("http://www.uniprot.org/uniprot/P19438.xml")
# scrape a Uniprot page into R
# this readLines() function is a base R function which allows reading webpages
data <- readLines(url)
print(paste(url, "has just been scraped"))
# this readLines() function is a base R function which allows reading webpages
data <- readLines(url)
print(paste(url, "has just been scraped"))
# this function will clean up the data by removing the xml code
# https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r/17227415#17227415
cleanXml <- function(x) {
return(gsub("<.*?>", "", x))
}
# identify where the proteins names are on the scraped data
name.list <- grep("<name>", data)
name <- cleanXml(data[name.list[1]])
# identify where the interaction data lies on the scraped data
int.list <- grep("<comment type=\"interaction", data)
# 5 line numbers supplied in answer
# corresponds to five interactions on the Uniprot page...
# these are the details we want to extract:
features <- c("uniprot_id", # 3 lines after returned number
"uniprot_label", # 4 lines after returned number
"uniprot_exp") # 7 lines after returned number
# this function will extract the information from the scraped data
extInteractDataFromPage <- function(x){
uniprot_id <- cleanXml(data[x + 3])
uniprot_label <- cleanXml(data[x + 4])
uniprot_exp <- cleanXml(data[x + 7])
three.val <- c(uniprot_id, uniprot_label, uniprot_exp)
return(three.val)
}
# use lapply() to apply the function to each interaction
output <- lapply(int.list, extInteractDataFromPage)
# generates a list of 5 proteins that interact with TNFR1
# to graph this using ggplot, we need to turn the list into a data.frame
output.1 <- as.data.frame(t(as.data.frame(output)))
str(output.1)
# but all the values are Factors so need to turn them into characters...
output.1$V1 <- as.character(output.1$V1)
output.1$V2 <- as.character(output.1$V2)
# ...and into numbers.
output.1$V3 <- as.numeric(as.character(output.1$V3))
str(output.1)
# looks better with characters (chr) and numbers (num)
# let's draw the graph with ggplot
g <- ggplot(output.1, # this is the dataframe
aes(x = V2, # names of the proteins
y = V3)) + # number of experiments showing the interaction
geom_bar(stat="identity") + # make a bar chart
theme_bw()
g # output the graph
# improve the labels
g <- g + xlab("") +
ylab("Number of experiments") +
theme(axis.title.x = element_text(size = 16)) +
theme(axis.title.y = element_text(size = 16)) +
theme(axis.text.x = element_text(size = 16)) +
theme(axis.text.y = element_text(size = 16))
g
# add a graph title and a source for the data
g <- g + ggtitle(paste("Interacting proteins for", name,"\n Source:", url))
g
g + coord_flip() # turn the bar chart on it's side.