R for Biochemists: August 2016

There is a vast amount of biochemical, biological and molecular data available on the internet. Lots of it is free and some of it is well curated. It's useful to be able to 'scrape' data from good quality databases and web pages.

One of my favourite sources of molecular information is the Uniprot database. According to itself, it is "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".

A little while ago, I showed how to use R to draw a schematic of the receptor of the pro-inflammatory cytokine, TNFR1. This post shows how to scrape the list of interacting proteins from the Uniprot page in order to draw a graph of the number of experiments showing each interaction.

For scraping the webpage, I use the xml version of the page and the base R function readLines(). There are more complex ways to download the page including packages and functions that will read in xml data. I wanted to start simply in the first instance...

Here is the graph:

Here is the script:

# START

library(ggplot2)

# extract interaction data from the TNFR1 Uniprot page

# this is the URL for the TNFR1 Uniprot page

url <- c("http://www.uniprot.org/uniprot/P19438.xml")

# scrape a Uniprot page into R

# this readLines() function is a base R function which allows reading webpages

data <- readLines(url)

print(paste(url, "has just been scraped"))

# this readLines() function is a base R function which allows reading webpages

data <- readLines(url)

print(paste(url, "has just been scraped"))

# this function will clean up the data by removing the xml code

# https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r/17227415#17227415

cleanXml <- function(x) {

return(gsub("<.*?>", "", x))

}

# identify where the proteins names are on the scraped data

name.list <- grep("<name>", data)

name <- cleanXml(data[name.list[1]])

# identify where the interaction data lies on the scraped data

int.list <- grep("<comment type=\"interaction", data)

# 5 line numbers supplied in answer

# corresponds to five interactions on the Uniprot page...

# these are the details we want to extract:

features <- c("uniprot_id", # 3 lines after returned number

"uniprot_label", # 4 lines after returned number

"uniprot_exp") # 7 lines after returned number

# this function will extract the information from the scraped data

extInteractDataFromPage <- function(x){

uniprot_id <- cleanXml(data[x + 3])

uniprot_label <- cleanXml(data[x + 4])

uniprot_exp <- cleanXml(data[x + 7])

three.val <- c(uniprot_id, uniprot_label, uniprot_exp)

return(three.val)

}

# use lapply() to apply the function to each interaction

output <- lapply(int.list, extInteractDataFromPage)

# generates a list of 5 proteins that interact with TNFR1

# to graph this using ggplot, we need to turn the list into a data.frame

output.1 <- as.data.frame(t(as.data.frame(output)))

str(output.1)

# but all the values are Factors so need to turn them into characters...

output.1$V1 <- as.character(output.1$V1)

output.1$V2 <- as.character(output.1$V2)

# ...and into numbers.

output.1$V3 <- as.numeric(as.character(output.1$V3))

str(output.1)

# looks better with characters (chr) and numbers (num)

# let's draw the graph with ggplot

g <- ggplot(output.1, # this is the dataframe

aes(x = V2, # names of the proteins

y = V3)) + # number of experiments showing the interaction

geom_bar(stat="identity") + # make a bar chart

theme_bw()

g # output the graph

# improve the labels

g <- g + xlab("") +

ylab("Number of experiments") +

theme(axis.title.x = element_text(size = 16)) +

theme(axis.title.y = element_text(size = 16)) +

theme(axis.text.x = element_text(size = 16)) +

theme(axis.text.y = element_text(size = 16))

# add a graph title and a source for the data

g <- g + ggtitle(paste("Interacting proteins for", name,"\n Source:", url))

g + coord_flip() # turn the bar chart on it's side.

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Monday 15 August 2016

A simple web scrape of protein interaction data from Uniprot...