A site to help Biochemists learn R.

Starting points

Wednesday, 29 July 2015

Analysing some citation data in R....

Last week, I showed a graph that detailed the number of publications in my field of study: chronic lymphocytic leukemia - over 17,000 publications with over 4,000 between 2010 and 2014. This inspired me to ask the question of what papers, from 2010 to 2014 were the most cited papers in the field. What were the MUST READ papers?

To do this required me to download citation data and do some analysis. Downloading citation data is not difficult but it does take up a bit of internet time. You don't need to do it yourself to appreciate the graph that I have made or the list of papers that it generated. 

Here is the graph:

Here is the script to draw this graph:
# This data file has a list of the PubMed IDs, the year and the citation data 
data <- read.csv("http://science2therapy.com/data/cllCitation2010to2014_20150722.csv", header=T)

cit <- data$cit

# not very useful but good practice to plot the data first...
plot(density(cit), log='x')

# not very useful in ggplot either
p <- ggplot(data=data,          # specify the data frame with data
            aes(x=cit)) +       # specify the x and y for the graph
  geom_bar(binwidth = 10)    # it's a bar plot

p   # show the plot

# so lots of the publications with relatively few citations. 

# Do some subsetting to identify highly cited papers. 
# http://www.statmethods.net/management/subset.html

# calculate the mean number of citations 
mean.cit <- mean(data$cit)   # 4.1 for this data set. 

# data frame of publications with no citations
newdata.zero <- subset(data, cit == 0)

# data frame of publications with one citation
newdata.one <- subset(data, cit == 1)

# make a data frame of publications with more than one citations upto the mean 
newdata.greater1 <- subset(data, cit > 1)
newdata.mean <- subset(newdata.greater1, cit < mean.cit)

# make a data frame of publications with more than the mean citations
# up to the mean squared
newdata.greatermean <- subset(data, cit > mean.cit)
newdata.meanSq <- subset(newdata.greatermean, cit < (mean.cit^2))

# make a data frame of publications with more than the mean squared citations
# up to the mean cubed
newdata.greatermeanSq <- subset(data, cit > (mean.cit^2))
newdata.SqtoCube <- subset(newdata.greatermeanSq, cit < (mean.cit^3)) 

# make a data frame of publications with more than the mean cubed citations
newdata.greaterCube <- subset(data, cit > (mean.cit^3))

# assemble these numbers into a vector
count<-c(nrow(newdata.zero), nrow(newdata.one), nrow(newdata.mean), 
         nrow(newdata.meanSq), nrow(newdata.SqtoCube), nrow(newdata.greaterCube))

# simple barplot

# create a list of labels
lab=c("0","1","2-4","5-16","17-64", ">64")

# assemble a new data frame to plot with ggplot
df <- as.data.frame(count)
df$label <- lab
df$labfac <- factor(df$label, as.character(df$label))

# do a nice histogram of citation frequency in ggplot
p <- ggplot(data=df, aes(y=count)) + 
  geom_bar(aes(x=labfac), data=df, stat="identity") + 
  xlab("Number of citations") +   # label x-axis
  ylab("Number of Papers") +    # label y-axis
  ggtitle("Chronic Lymphocytic Leukemia Papers published 2010 to 2014") +  # add a title
  theme_bw() +      # a simple theme
  expand_limits(y=c(0,2000)) +   # customise the y-axis
  theme(axis.title.y = element_text(size = 14 )) + 
  theme(axis.title.x = element_text(size = 14 )) + 
  theme(axis.text = element_text(size = 12))

p    #show us the plot


Thoughts on the citation analysis

The average number of citations per paper was just over 4. 
Only 24 papers were cited more than 64 times. 

Here is a list of the 5 most highly cited papers from 2010 to 2014:

  1. Porter, et al 2011 N Engl J Med “Chimeric antigen receptor-modified T cells in chronic lymphoid leukemia.” Cited: 414 times 
  2. Stephens et al 2011 Cell “Massive genomic rearrangement acquired in a single catastrophic event during cancer development.” Cited: 333 times 
  3. Kalos et al 2011 Sci Transl Med “T cells with chimeric antigen receptors have potent antitumor effects and can establish memory in patients with advanced leukemia.” Cited: 287 times 
  4. Puente et al 2011 Nature “Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia.” Cited: 228 times 
  5. Grupp et al 2013 N Engl J Med “Chimeric antigen receptor-modified T cells for acute lymphoid leukemia.” Cited: 205 times 

Full list of the 24 papers is available as a PDF here

No comments:

Post a Comment

Comments and suggestions are welcome.