A site to help Biochemists learn R.

Starting points

Tuesday, 21 July 2015

How many publications in your area of research?

The amount of scientific literature is daunting. There are over 17,000 papers published in my research area which is chronic lymphocytic leukemia. I wanted to graph this so here is what I did:

I did a PubMed search using "chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL". When the search is made, there is a nice graph on the left hand side of the browser window - see the circle in the picture below. You can download a csv file of this data by pressing on "Download CSV".



If you import this CSV file into R, it looks a bit strange. The str(data) command gives you the following output:

> str(data)
'data.frame': 64 obs. of  1 variable:
 $ pubmed...chronic.lymphocytic.leukaemia.or.chronic.lymphocytic.leukemia.or.CLL: Factor w/ 57 levels "1","10","1001",..: 57 51 3 56 53 52 55 50 48 47 ...

This is because it has a line at the top of the file that R can't interpret properly. You need to open the file and delete the first line - a little bit of data wrangling. I did this in Excel.
I also did a search with (("chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL")) AND REVIEW[Publication Type] which gave me 2,236 articles. I combined these two bits of data in Excel. I saved this as a csv file.

When I import this new file, the str(data) command gives this:

> str(data)
'data.frame': 63 obs. of  3 variables:
 $ year    : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
 $ articles: int  725 1001 930 787 776 819 695 618 608 594 ...
 $ reviews : int  38 147 185 124 112 135 97 74 126 112 ...

This looks much more useful.

Here is the graph that I have made:



Here is the script for this:


# activate the required packages
library(ggplot2)
library(reshape2)
library(ggthemes)

# do search in pubmed on the web
# pubmed - "chronic lymphocytic leukaemia" or "chronic lymphocytic leukemia" or "CLL"
# get an option to download a csv file
# open the file and remove the top line to prevent an error 
# add other data if you want and change the column titles to something useful
setwd("/Users/paulbrennan/Documents/Work Documents/Staff & Collaborations/RforBiochemists/RforBiochemists/data")
data <- read.csv("cllPublicationsformatted.csv")

# reshape the data from wide to long format
data.melt <- melt(data, id.vars = "year", value.name = "pubs", variable.name = "pub.type")

# draw the graph
p <- ggplot(data.melt, aes(x=year, 
                           y= pubs, 
                           colour = factor(pub.type, labels = c("Articles", "Reviews")))) + 
# colour = factor and the labels allows us to customize the legend
    geom_line(size=1) +
    labs(color = "Publication Type") + # customizes the legend title
    scale_colour_manual(values=c("black","red")) +
    ylab("Number Publications") + # y-label
    xlab("Year") + # x-label
    ggtitle("Chronic Lymphocytic Leukaemia Publications by Year") +
    scale_x_continuous(breaks=c(1960,1970,1980,1990, 2000, 2010)) +
    theme_bw()

p <- p + theme(legend.position=c(0,1), # move to the top left
               legend.justification=c(0,1), # move it in a bit
               legend.text=element_text(size = 12), # increase size of text
               legend.title=element_text(size = 12)) # and the title
               
p <- p + theme(axis.title.y = element_text(size = 14 )) + 
         theme(axis.text = element_text(size = 12))

p # show the graph



N.B. You do get a warning:
Warning message:
In loop_apply(n, do.ply) :
  Removed 17 rows containing missing values (geom_path).

This is because there is more data for the articles than the reviews. Nothing to worry about. 

This resource helped me: https://rpubs.com/hughes/10012

No comments:

Post a Comment

Comments and suggestions are welcome.