Monday, 15 August 2016

A simple web scrape of protein interaction data from Uniprot...

There is a vast amount of biochemical, biological and molecular data available on the internet. Lots of it is free and some of it is well curated. It's useful to be able to 'scrape' data from good quality databases and web pages.

One of my favourite sources of molecular information is the Uniprot database. According to itself, it is "a comprehensive, high-quality and freely accessible resource of protein sequence and functional information".

A little while ago, I showed how to use R to draw a schematic of the receptor of the pro-inflammatory cytokine, TNFR1. This post shows how to scrape the list of interacting proteins from the Uniprot page in order to draw a graph of the number of experiments showing each interaction. 

For scraping the webpage, I use the xml version of the page and the base R function readLines(). There are more complex ways to download the page including packages and functions that will read in xml data. I wanted to start simply in the first instance...

Here is the graph: 





Here is the script:

# START
library(ggplot2)
# extract interaction data from the TNFR1 Uniprot page

# this is the URL for the TNFR1 Uniprot page
url <- c("http://www.uniprot.org/uniprot/P19438.xml")

# scrape a Uniprot page into R
# this readLines() function is a base R function which allows reading webpages
data <- readLines(url)
print(paste(url, "has just been scraped"))

# this readLines() function is a base R function which allows reading webpages
data <- readLines(url)
print(paste(url, "has just been scraped"))

# this function will clean up the data by removing the xml code
cleanXml <- function(x) {
  return(gsub("<.*?>", "", x))
}

# identify where the proteins names are on the scraped data
name.list <- grep("<name>", data)
name <- cleanXml(data[name.list[1]])

# identify where the interaction data lies on the scraped data
int.list <- grep("<comment type=\"interaction", data)
# 5 line numbers supplied in answer
# corresponds to five interactions on the Uniprot page...

# these are the details we want to extract:
features <- c("uniprot_id",    # 3 lines after returned number
              "uniprot_label", # 4 lines after returned number
              "uniprot_exp")   # 7 lines after returned number

# this function will extract the information from the scraped data
extInteractDataFromPage <- function(x){
  uniprot_id <- cleanXml(data[x + 3])
  uniprot_label <- cleanXml(data[x + 4])
  uniprot_exp <- cleanXml(data[x + 7])
  three.val <- c(uniprot_id, uniprot_label, uniprot_exp)
  return(three.val)
}

# use lapply() to apply the function to each interaction
output <- lapply(int.list, extInteractDataFromPage)
# generates a list of 5 proteins that interact with TNFR1

# to graph this using ggplot, we need to turn the list into a data.frame
output.1 <- as.data.frame(t(as.data.frame(output)))
str(output.1)
# but all the values are Factors so need to turn them into characters...
output.1$V1 <- as.character(output.1$V1)
output.1$V2 <- as.character(output.1$V2)
# ...and into numbers. 
output.1$V3 <- as.numeric(as.character(output.1$V3))
str(output.1) 
# looks better with characters (chr) and numbers (num)

# let's draw the graph with ggplot
g <- ggplot(output.1,         # this is the dataframe
            aes(x = V2,       # names of the proteins
                y = V3)) +    # number of experiments showing the interaction
  geom_bar(stat="identity") + # make a bar chart
  theme_bw()
g   # output the graph



# improve the labels
g <- g + xlab("") +
         ylab("Number of experiments") +  
         theme(axis.title.x = element_text(size = 16)) +
         theme(axis.title.y = element_text(size = 16)) +
         theme(axis.text.x = element_text(size = 16)) +
         theme(axis.text.y = element_text(size = 16))
g



# add a graph title and a source for the data
g <- g + ggtitle(paste("Interacting proteins for", name,"\n Source:", url))
g




g + coord_flip()  # turn the bar chart on it's side. 


Friday, 24 June 2016

Principal Component Analysis with published CLL gene expression data (Herishanu et al, Blood 2011)

Over the last few years, my understanding of chronic lymphocytic leukaemia has developed. Leukaemic cell proliferation has been shown to be a key element of the disease. The site of CLL cell proliferation was an important question. A gene expression analysis published in Blood by Herishanu et al in 2011 showed a proliferative gene expression pattern in the leukeamic cells from the lymph nodes as distinct from other sites of the body (peripheral blood or bone marrow).

I've been exploring the gene expression data using R. As a first step, I've explored the principal component analysis similar to that shown in Figure 1A of the paper.



The PCA is convincing because the three groups of samples cluster among themselves.  The 3D plot is useful because the three components separate the groups while two dimensions do not.


There are other ways of visualising PCA plots...  Sources are indicated as part of the script. 



Here is the script that I used to analyse this:

SCRIPT START
# install.packages("RCurl", "scatterplot3d")

library(RCurl)
library(scatterplot3d)

# A few year's ago a gene expression profile paper was published in Blood by Herishanu et al
# that gave us a new insight into chronic lymphocytic leukamia. 
# http://www.bloodjournal.org/content/117/2/563.long
# I have written this script to explore the data 
# and to reproduce the principal component analysis. 

# I have downloaded the original CEL files from GEO
# they were placed in a folder and the expression values extracted 
# what's produced is Affymetrix IDs and log2 expression values 

# I have created two data sets
# we can reproduce the PCA workflow with a subset of the data is required (5000 genes selected randomly). 
# this will be faster to download and visualise. 
# get the data - needs connection to the internet
# to use remove the hashtag "#"
# x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/herishanuMicroArrayDataSubset.tsv")

# here is a link for all the data (>50000 probes) 
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/herishanuMicroArrayData.tsv")

# turn the data into a useful data.frame
data <- read.table(text = x, header = TRUE, sep = "\t")

# read in a file with names of samples
# created this from cut and paste of web page combined with manipulation in Word & Excel
sampID <- read.csv("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/Herishanu_Samp_ID.csv", header = FALSE)

str(sampID)
sampID$V1 <- as.character(sampID$V1)

# is order of column names same as list in sampID?
data.col.name <- colnames(data)
data.col.name <- gsub(".CEL", "", data.col.name)
sampID.name <- sampID$V1
all.equal(data.col.name, sampID.name)
# TRUE so that is useful. 


# change column names from chip to PB, BM and LN
# for peripheral blood, bone marrow and lymph node respectively
sampID$Name <- paste0(as.character(sampID$V3), as.character(sampID$V5))
colnames(data) <- sampID$Name

# how do the sample cluster?
# I first need to find the "distances" between the arrays
# and show those distances in the hierarchical cluster plot
# I can pile up all the commands in one
# for more of a step by step approach, see these scripts:

plot(hclust(dist(t(data))))



# this is interesting because all the patient samples cluter together!
# with bone marrow and peripheral blood closer together than LN in all cases. 
# this is what it says in the paper and is shown in the supplementary data. 

# What's needed is a way to normalise by patient - HOW??
# Quote from the paper:
# "In the 12 patients in whom all 3 compartments had been arrayed, 
# the patient effect on gene expression was subtracted 
# by mean centering the expression value of each gene across the 3 compartments 
# for each patient separately."

# the twelve with all 3 compartments: #1, #2, #3, #4, #8, #9, #10, #11, #12, #13, #25, #26

# so if I understand this correctly, I need to extract a patient sample
# then mean centre all three samples, 
# and repeat for all 12 samples

# let's explore this concept with just two samples to see if it works
# pick out data for patient #26
samp26 <- data[,grep("#26", sampID$Name)]

# pick out data for patient #25
samp25 <- data[,grep("#25", sampID$Name)]

# put them together
twosamp <- cbind(samp26, samp25)


# visualise with a box plot - look normalised
boxplot(twosamp)



plot(hclust(dist(t(twosamp))))


# cluster by patient

# I think I understand what they have done.
# mean centred for each gene.... i.e. by row in the data
# For us this is really each probeset
# http://gastonsanchez.com/how-to/2014/01/15/Center-data-in-R/

# for each of the values substract the rowMeans
samp25.s <- samp25 - rowMeans(samp25)
samp26.s <- samp26 - rowMeans(samp26)
twosamp.s <- cbind(samp26.s, samp25.s)

plot(hclust(dist(t(twosamp.s))))




# This seems to work and gives a different cluster diagram
# with LN coming together nicely.

# apply this to the whole data set.
# extract each patient cohort
# these twelve were: #1, #2, #3, #4, #8, #9, #10, #11, #12, #13, #25, #26
# do #1 and # 2 first then do the other automatically...

# pat # 1
colnames(data)
pat_1 <- cbind(data[,1], data[,27], data[,46])
colnames(pat_1) <- c("PB#1", "BM#1", "LN#1")
pat_1.s <- pat_1 - rowMeans(pat_1)
# pat # 2
pat_2 <- cbind(data[,2], data[,28], data[,47])
colnames(pat_2) <- c("PB#2", "BM#2", "LN#2")
pat_2.s <- pat_2 - rowMeans(pat_2)

data.n <- cbind(pat_1.s, pat_2.s)

data.reqd <- c("#3", "#4", "#8", "#9", "#10", "#11", "#12", "#13", "#25", "#26")
for(i in 1:length(data.reqd)){
  samp <- data[,grep(data.reqd[i], sampID$Name)]
  samp <- samp - rowMeans(samp)
  data.n <- cbind(data.n, samp)
}

# now have a file called dat.exp.n - normalised within patients.
colnames(data.n)
plot(hclust(dist(t(data.n))))



# nice cluster by region with good separation of LN samples
# some overlap within PB and LN


# extract cell site to use for colours in PCA plots
names <- colnames(data.n)
colourby <- gsub("#\\d+", "", colnames(data.n))
colourby <- gsub("\\.", "", colourby)
colourby <- gsub("\\d+", "", colourby)

# this does cluster the LN samples distinctly but there is still some overlap 
# of Peripheral Blood and Bone Marrow. 
# PCA from the paper looks nice and convincing
# http://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
# uses function princomp()

exp.pca <- princomp(data.n)  # this function does the PCA
print(exp.pca)
plot(exp.pca, type = "l")

Plot of variance by component for CLL gene expression patterns



plot(princomp(data.n)$loadings)
A simple 2D plot of component 1 vs component 2



p <- princomp(data.n)
loadings <- p$loadings[]
p.variance.explained <- p$sdev^2 / sum(p$sdev^2)

# plot percentage of variance explained for each principal component    
barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')


Looking at the effects of each component as a bar plot.



#*****************************************************************
# 2-D Plot
#******************************************************************         
x <- loadings[,1]
y <- loadings[,2]
z <- loadings[,3]
cols <- as.factor(colourby)
cols <- gsub("PB", "green", cols)
cols <- gsub("BM", "red", cols)
cols <- gsub("LN", "blue", cols)

# pch = 24: triangle point-up
# pch = 22: square
# pch = 21: circle
symbols <- as.factor(colourby)
symbols <- gsub("PB", 24, symbols)
symbols <- gsub("BM", 21, symbols)
symbols <- gsub("LN", 22, symbols)
symbols <- as.numeric(symbols)


# plot loadings on the first and second principal components 
# identify sample by body location
plot(x, y, type='p', 
     pch=symbols, 
     xlab='Comp.1', ylab='Comp.2', 
     col = cols, main = "Principal Component Analysis - CLL cell gene expression")

A simple 2D plot of component 1 vs component 2
 Each sample is coloured by location
# add a legend to the top of the plot
legend("top",      # location
       bty="n",              # suppress legend box, shrink text 50%
       title="Body Location of sample",
       c("PB", "BM", "LN"), 
       pch=symbols, col = cols, horiz=TRUE)
# label up the points
text(x, y, colnames(data.n), col=cols, cex=.5, pos=4)

# Comment: lymph node samples separate nicely from other samples
# some overlap between bone marrow and peripheral blood 

#*****************************************************************
# 3-D Plot, for good examples of 3D plots
#******************************************************************                 
# plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong
s3d = scatterplot3d(x, y, z, 
                    xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', 
                    color=cols, pch = symbols,
                    main = "Principal Component Analysis in 3D \n CLL cell gene expression")
A 3D graph separates the different groups of samples. 
s3d.coords = s3d$xyz.convert(x, y, z)
text(s3d.coords$x, s3d.coords$y, 
     labels=colnames(data.n), col=cols, cex=.8, pos=4)

Labels can be added but I'm not sure they help in this example. 

# change the order and the angle to make it look a bit more like the figure
s3d = scatterplot3d(y, x, z, 
                    xlab='PC2', ylab='PC1', zlab='PC3',             
                    color=cols, pch = symbols,
                    angle=-25,
                    grid = FALSE,
                    main = "Principal Component Analysis in 3D \n CLL cell gene expression")

par(xpd=TRUE)    # allows legend outside the graph

# add legend
legend(10, -4.5,     # location
       bty="n",              # suppress legend box
       title="Site of sample",
       c("PB", "BM", "LN"), 
       pch=symbols, col = cols, horiz=TRUE)

# add source as text
text(4, -6.5, cex =0.7,
     "Source: Herishanu et al, Blood 2011 117:563-574; doi:10.1182/blood-2010-05-284984")


SCRIPT END

Friday, 27 May 2016

Working with multiple Image Files....

I showcased this script, analysing multiple images, during our Image Analysis Workshop today. It was a very interesting morning. Dr Joaquín de Navascués supplied the Drosophila embryo images that we are analysing here.

In my previous two scripts, I have looked at analysing an image step by step. However, what we really want to do is to analyse multiple images and run the same analysis across all of them. One good way to do this in R is to bring all the images into one object - a list.

This allows us to write functions and then use the lapply( ) function to apply it to each element of the list.

The script below downloads 48 images from a folder on Github. These images are turned grey to allow masking. Then the regions of fluorescence are identified and the number of regions found with each staining is determined. We can look at the distribution of cells through the different layers.

The layers detail the staining of cells in a Drosophila embryo with different methods. The include fluorescence markers and nuclear stains.

Then I have extracted some data into a data frame and drawn graphs with ggplot2. Here are the graphs:


The graphs imply that the cell number as determined by the nuclear staining (blue graph) are similar in all the layers. They layers are generated as the fly embryo is imaged horizontally. The numbers of cells detected with Stain number 1 also stays constant while the number of cells detected with Stain number 2 diminish as we transit down the embryo.

Here is an example of the images:
Three of the images from the stack of 48. Left hand panel (green) is stain 1, middle panel is nuclei staining, right hand panel (white) is stain 2.


Here is the script. It's a bit long - 207 lines ! It might be easier to cut and paste from Github.
SCRIPT START
# install.packages("EBImage", "ggplot2", "RCurl")  # if required. 

library(EBImage)
library(ggplot2)
library(RCurl)

# this is a link to the list of files in our folider
link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/images/imageSeq/filenameslist"
# the download.file() function downloads and saves the file with the name given
download.file(url=link, destfile="filenameslist", mode="wb")
load("filenameslist")

# how many files are in this list?
length(filelist)
# 48

# this creates the first function called downloadImages
downloadImages <- function(x){
  # assemble URLs
  url <- paste0("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/images/imageSeq/", x)
  cat("Image downloaded") # so that I know something is happening. 
  img <- readImage(url)
}

# read in all the files into a list using the lapply( ) function
# lapply - applies a function to a list or a vector. 
# in this case, it applies the function readImage() to each element in the 
# object filelist
# lapply( ) syntax: name of list, name of function
images <- lapply(filelist, downloadImages)

# it takes a while to download 48 images - be patient

# we now have all 48 images in a list 
# to show us that R knows this is a list we can use the function mode( )
mode(images) # output to console says "list"

# to access a single image in the list we use double square brackets
display(images[[2]])  # display the second image in the list

# we can display the image brighter
display(images[[2]]*4, method = "raster")




# it might be good to save this locally so that it doesn't need 
# to be downloaded again
# CHECK where you want to save it first
# setwd("SomeWhereYouWantToSaveTheData")
# setwd("/Users/paulbrennan/Desktop")  # example
save(images, file = "Dros_Images")

# this can then be reloaded using the load( ) function
load("Dros_Images")

# Next step to alter the list of images a a group
# make all the images greyscale
# http://stackoverflow.com/questions/20347025/convert-rgb-image-to-one-channel-gray-image-in-r-ebimage
# create another function...
makeGrey <- function(img){
  print("One image grey")
  channel(img, mode = "grey")
}

# then apply the function to the list of images using lapply
# lapply( ) syntax: (name of list, name of function)
images.grey <-lapply(images, makeGrey)

# we have now created a list of grey scale images. 
display(images.grey[[2]]*4, method = "raster)  # allows to check. 




# write a function to create image masks
maskCells <- function(img1){   
  # blur the image
  img1 <- gblur(img1*2, sigma = 5)
  
  # apply a Otsu threshold 
  img1.mask <- img1 > otsu(img1)  
  
  # generate an image with different values for each connected region
  # key here is the bwlabel( ) function
  img1.mask <- bwlabel(img1.mask)
  
  # show this as a coloured blobs 
  display(colorLabels(img1.mask), method = "raster")
  
  
  # the bwlabel() function 'counts' the blobs
  count <- max(bwlabel(img1.mask))
  
  # this outputs the count to us
  cat('Number of cells in this image =', count,'\n')
  
  return(img1.mask)
}

# make the image masks using the function and the lapply
images.mask <- lapply(images.grey, maskCells)
# result is a list of image masks

# write function to extract number of objects masked
countRegions <- function(img1.mask){return(max(bwlabel(img1.mask)))}

# to extract numbers from the image masks use lapply and function again
numbers <- lapply(images.mask, countRegions)
# result is  a list of 48 numbers in this case

# use the numbers to make some plots
# in order to graph the numbers we put them into a data frame
numb <- unlist(numbers)
df <- data.frame(numb)
df$type <- c("memb", "st1", "nuc", "st2")
df$layer <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,
              10,10,10,10,11,11,11,11,12,12,12,12)

df.st1 <- subset(df, df$type == "st1")
df.st2 <- subset(df, df$type == "st2")
df.nuc <- subset(df, df$type == "nuc")

df.2 <- as.data.frame(cbind(df.nuc$layer, df.nuc$numb, df.st1$numb, df.st2$numb))
colnames(df.2) <- c("layer", "nuc", "st1", "st2")


# using the dataframe, I am making plots of the number of stained cells across the various layers
# make the plot for the numbers of nuclei
nucHist <- ggplot(df.2, 
                   aes(x=as.factor(layer), 
                       y=nuc)) +
            geom_bar(stat="identity", fill = "blue") +
            ggtitle("Nuclei per layer") +
            ylab("Number of regions") +
            xlab("Layers from top to bottom") +
            theme_bw()
# show the plot
nucHist



# make the plot for the numbers of regions stained with stain number 1
stain1Hist <- ggplot(df.2, 
                     aes(x=as.factor(layer), 
                         y=st1)) +
              geom_bar(stat="identity", fill = "green") +
              ggtitle("Stain number 1") +
              ylab("Number of regions") +
              xlab("Layers from top to bottom") +
              theme_bw()

# make the plot for the numbers of regions stained with stain number 2
stain2Hist <- ggplot(df.2, 
                     aes(x=as.factor(layer),
                         y=st2)) +
              geom_bar(stat="identity", fill = "grey") + 
              ggtitle("Stain number 2") +
              ylab("Number of regions") +
              xlab("Layers from top to bottom") + 
              theme_bw()



# from: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}


# run code above to make multiplot function first
multiplot(stain1Hist, nucHist, stain2Hist, cols = 3)
SCRIPT END