R for Biochemists: How many publications in your area of research?

The amount of scientific literature is daunting. There are over 17,000 papers published in my research area which is chronic lymphocytic leukemia. I wanted to graph this so here is what I did:

I did a PubMed search using "chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL". When the search is made, there is a nice graph on the left hand side of the browser window - see the circle in the picture below. You can download a csv file of this data by pressing on "Download CSV".

If you import this CSV file into R, it looks a bit strange. The str(data) command gives you the following output:

> str(data)
'data.frame': 64 obs. of 1 variable:
$ pubmed...chronic.lymphocytic.leukaemia.or.chronic.lymphocytic.leukemia.or.CLL: Factor w/ 57 levels "1","10","1001",..: 57 51 3 56 53 52 55 50 48 47 ...

This is because it has a line at the top of the file that R can't interpret properly. You need to open the file and delete the first line - a little bit of data wrangling. I did this in Excel.
I also did a search with (("chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL")) AND REVIEW[Publication Type] which gave me 2,236 articles. I combined these two bits of data in Excel. I saved this as a csv file.

When I import this new file, the str(data) command gives this:

> str(data)
'data.frame': 63 obs. of 3 variables:
$ year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
$ articles: int 725 1001 930 787 776 819 695 618 608 594 ...
$ reviews : int 38 147 185 124 112 135 97 74 126 112 ...

This looks much more useful.

Here is the graph that I have made:

Here is the script for this:

# activate the required packages
library(ggplot2)
library(reshape2)
library(ggthemes)

# do search in pubmed on the web
# pubmed - "chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL"
# get an option to download a csv file
# open the file and remove the top line to prevent an error
# add other data if you want and change the column titles to something useful
setwd("/Users/paulbrennan/Documents/Work Documents/Staff & Collaborations/RforBiochemists/RforBiochemists/data")
data <- read.csv("cllPublicationsformatted.csv")

# reshape the data from wide to long format
data.melt <- melt(data, id.vars = "year", value.name = "pubs", variable.name = "pub.type")

# draw the graph
p <- ggplot(data.melt, aes(x=year,
y= pubs,
colour = factor(pub.type, labels = c("Articles", "Reviews")))) +
# colour = factor and the labels allows us to customize the legend
geom_line(size=1) +
labs(color = "Publication Type") + # customizes the legend title
scale_colour_manual(values=c("black","red")) +
ylab("Number Publications") + # y-label
xlab("Year") + # x-label
ggtitle("Chronic Lymphocytic Leukaemia Publications by Year") +
scale_x_continuous(breaks=c(1960,1970,1980,1990, 2000, 2010)) +
theme_bw()

p <- p + theme(legend.position=c(0,1), # move to the top left
legend.justification=c(0,1), # move it in a bit
legend.text=element_text(size = 12), # increase size of text
legend.title=element_text(size = 12)) # and the title

p <- p + theme(axis.title.y = element_text(size = 14 )) +
theme(axis.text = element_text(size = 12))

p # show the graph

N.B. You do get a warning:

Warning message:

In loop_apply(n, do.ply) :

Removed 17 rows containing missing values (geom_path).

This is because there is more data for the articles than the reviews. Nothing to worry about.

This resource helped me: https://rpubs.com/hughes/10012

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Tuesday, 21 July 2015

How many publications in your area of research?

No comments:

Post a Comment