A site to help Biochemists learn R.

Starting points

Thursday, 21 May 2015

Comparing two lists of proteins....

Last year, we published a paper analysing the CLL proteome. It was entitled: Proteomics-Based Strategies To Identify Proteins Relevant to Chronic Lymphocytic Leukemia (Alsagaby et al, J. Proteome Res., 2014, 13 (11), pp 5051–5062). As part of this, we produced a list of 728 proteins which is available from the JPR website as an Excel file.

This year a group from Liverpool published a very nice paper analysing the CLL proteome (Eagle et al (2015) Mol Cell Proteomics, 14, 933-945). They identified 3521 proteins! Their list is available here.

So, I want to compare the two lists and I think R is a good tool to use.

Here are the steps in my workflow:

  1. Download the data - I did this with the web browser and put the files in the same folder. 
  2. Import both files into R. I used the new Excel importer from Hadley Wickham. Because of the layout of my data, I had to clean my data frame to make the data look better (see first script, below). 
  3. The Liverpool files was imported into R much more easily. 
  4. Check out both files to see how we can compare the lists - sadly we used different protein Accession numbering systems. We used the UniProt system while the Liverpool group used the Swiss Prot system. Fortunately, we both added the full names of the proteins. 
  5. Use the intersect() function to find the common words - NO OVERLAP!
  6. Trouble shoot - this seemed extremely unlikely so I searched for an abundant protein "Actin, cytoplasmic 2" using the grep() function (see script 2 below). This showed that it was present in both but that the strings were different length due to a "trailing white space". 
  7. Clean up with spaces with a custom function and apply to listJPR:
    1. trim.trailing <- function (x) sub("\\s+$", "", x)
    2. listJPR<-trim.trailing(listJPR)
  8. Now the intersect() function reveals 625 proteins - sounds much more reasonable. 
  9. Prepare a Venn Diagram
Here is the Venn Diagram:

Here are the scripts for all this:

Import the protein list from the JPR manuscript and clean it up (cleanJPRproteinList.R):

# Import the Excel spreadsheets from published JPR paper
# This is the URL: 
# http://pubs.acs.org/doi/suppl/10.1021/pr5002803/suppl_file/pr5002803_si_003.xlsx
# only has one sheet

# Going to try new readxl package:
install.packages("readxl")  # only necessary 1st time

# would be good if I could get this directly from the webpage 
# but I need to use local file at the moment. 
data <- read_excel("pr5002803_si_003.xlsx")

# some need to cleaning of the files...
data2 <- data[4:731,2:6] #just the columns I need now
names <- c("Prot.no", "Id.prot","Ac.No", "Pep.Cnt", "Tot.Ion.Scr")
colnames(data2) <- names
rownames(data2) <- NULL #not necessary or useful

# this is now a list of 728 proteins with peptide count and total ion score
write.csv(data2, file = "jprProtList.csv")

Import the data, no overlap, use grep() to reveal a problem (twoProteinLists_grep.R):

dataMCP <- read_excel("mcp.M114.044479-2.xls")
dataJPR <- read.csv("jprProtList.csv")

listMCP <- dataMCP[,1]
listJPR <- as.character(dataJPR$Id.prot)

overlap <- intersect(listMCP, listJPR)
# Look for an abundant protein that should be present in both
# Actin - cytoplasmic 2
# find where it is and count the characters
i <- grep("Actin, cytoplasmic 2", listJPR)
# 21 characters...

i <- grep("Actin, cytoplasmic 2", listMCP)
# 20 characters...

# different number of characters!
# data from JPR paper has a space at the end.... arghh!

Import the data, eliminate trailing white space, measure overlap and draw Venn Diagram using Venn Diagram package (cleanProtListdrawVennDiag.R): 

dataMCP <- read_excel("mcp.M114.044479-2.xls")
dataJPR <- read.csv("jprProtList.csv")
# Id.prot has been imported as a factor

listMCP <- dataMCP[,1]
listJPR <- as.character(dataJPR$Id.prot) # convert to character

# get rid of trailing white space with this function
# from: http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r#
trim.trailing <- function (x) sub("\\s+$", "", x)

# http://stackoverflow.com/questions/17598134/compare-two-lists-in-r
overlap <- intersect(listMCP, listJPR)  # easy/useful - 625 proteins
MCP.unique <- setdiff(listMCP, listJPR) # in 1st NOT 2nd
JPR.unique <- setdiff(listJPR, listMCP)
full.list <- unique(c(listMCP,listJPR))

#package VennDiagram
install.packages("VennDiagram") # only necessary 1st time

#draw the Venn diagram using the venn.plot() function
venn.plot <- draw.pairwise.venn(area1 = length(listMCP),#no MCP set
                           area2 = length(listJPR),#no JPR set
                           cross.area = length(overlap),
                           c("MCP Data", "JPR Data"), scaled = TRUE,
                           fill = c("green", "blue"),
                           cex = 1.5,
                           cat.cex = 1.5,
                           cat.pos = c(320, 25),
                           cat.dist = .05) 

No comments:

Post a Comment

Comments and suggestions are welcome.