R for Biochemists: Analysing some citation data in R....

Last week, I showed a graph that detailed the number of publications in my field of study: chronic lymphocytic leukemia - over 17,000 publications with over 4,000 between 2010 and 2014. This inspired me to ask the question of what papers, from 2010 to 2014 were the most cited papers in the field. What were the MUST READ papers?

To do this required me to download citation data and do some analysis. Downloading citation data is not difficult but it does take up a bit of internet time. You don't need to do it yourself to appreciate the graph that I have made or the list of papers that it generated.

Here is the graph:

Here is the script to draw this graph:

START

# This data file has a list of the PubMed IDs, the year and the citation data

data <- read.csv("http://science2therapy.com/data/cllCitation2010to2014_20150722.csv", header=T)

str(data)

cit <- data$cit

# not very useful but good practice to plot the data first...

plot(density(cit))

plot(density(cit), log='x')

hist(cit)

# not very useful in ggplot either.

p <- ggplot(data=data, # specify the data frame with data

aes(x=cit)) + # specify the x and y for the graph

geom_bar(binwidth = 10) # it's a bar plot

p # show the plot

# so lots of the publications with relatively few citations.

# Do some subsetting to identify highly cited papers.

# http://www.statmethods.net/management/subset.html

# calculate the mean number of citations

mean.cit <- mean(data$cit) # 4.1 for this data set.

# data frame of publications with no citations

newdata.zero <- subset(data, cit == 0)

# data frame of publications with one citation

newdata.one <- subset(data, cit == 1)

# make a data frame of publications with more than one citations upto the mean

newdata.greater1 <- subset(data, cit > 1)

newdata.mean <- subset(newdata.greater1, cit < mean.cit)

# make a data frame of publications with more than the mean citations

# up to the mean squared

newdata.greatermean <- subset(data, cit > mean.cit)

newdata.meanSq <- subset(newdata.greatermean, cit < (mean.cit^2))

# make a data frame of publications with more than the mean squared citations

# up to the mean cubed

newdata.greatermeanSq <- subset(data, cit > (mean.cit^2))

newdata.SqtoCube <- subset(newdata.greatermeanSq, cit < (mean.cit^3))

# make a data frame of publications with more than the mean cubed citations

newdata.greaterCube <- subset(data, cit > (mean.cit^3))

# assemble these numbers into a vector

count<-c(nrow(newdata.zero), nrow(newdata.one), nrow(newdata.mean),

nrow(newdata.meanSq), nrow(newdata.SqtoCube), nrow(newdata.greaterCube))

# simple barplot

barplot(count)

# create a list of labels

lab=c("0","1","2-4","5-16","17-64", ">64")

# assemble a new data frame to plot with ggplot

df <- as.data.frame(count)

df$label <- lab

df$labfac <- factor(df$label, as.character(df$label))

# do a nice histogram of citation frequency in ggplot

p <- ggplot(data=df, aes(y=count)) +

geom_bar(aes(x=labfac), data=df, stat="identity") +

xlab("Number of citations") + # label x-axis

ylab("Number of Papers") + # label y-axis

ggtitle("Chronic Lymphocytic Leukemia Papers published 2010 to 2014") + # add a title

theme_bw() + # a simple theme

expand_limits(y=c(0,2000)) + # customise the y-axis

theme(axis.title.y = element_text(size = 14 )) +

theme(axis.title.x = element_text(size = 14 )) +

theme(axis.text = element_text(size = 12))

p #show us the plot

END

Thoughts on the citation analysis

The average number of citations per paper was just over 4.

Only 24 papers were cited more than 64 times.

Here is a list of the 5 most highly cited papers from 2010 to 2014:

Porter, et al 2011 N Engl J Med “Chimeric antigen receptor-modified T cells in chronic lymphoid leukemia.” Cited: 414 times
Stephens et al 2011 Cell “Massive genomic rearrangement acquired in a single catastrophic event during cancer development.” Cited: 333 times
Kalos et al 2011 Sci Transl Med “T cells with chimeric antigen receptors have potent antitumor effects and can establish memory in patients with advanced leukemia.” Cited: 287 times
Puente et al 2011 Nature “Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia.” Cited: 228 times
Grupp et al 2013 N Engl J Med “Chimeric antigen receptor-modified T cells for acute lymphoid leukemia.” Cited: 205 times

Full list of the 24 papers is available as a PDF here.

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Wednesday, 29 July 2015

Analysing some citation data in R....

Thoughts on the citation analysis

No comments:

Post a Comment