R for Biochemists: October 2015

Monday, 26 October 2015

Exploring UN data on researchers...

Last Tuesday was World Statistics Day 2015 as declared by the United Nations. I took the opportunity to play with some of the data about the number of researchers around the globe and made a visualisation about them. The visualisation looks at the possible relationship between the number of researchers per 1000 population and the amount of money spent as a proportion of GDP. The style is inspired by Gapminder and the page by Professor Jennifer Bryan where the countries are sized by population and coloured by continent.

Here is the visualisation:

To make this visualisation, I had to download data from the UN website about the number of researchers. The screen grab is here:

and here:

My understanding of the Terms and Conditions suggests that I'm allowed to put the data I downloaded onto Github which I have done to allow it to be downloaded within R. The filename has the date and time of download so that I know when I did it.

Here is the script I wrote while playing with the data and some graphs I made along the way.

The script illustrates one interesting feature of ggplot which is the ability to force a different data frame into an existing plotting object using the %+% command. I heard about %+% first from Steph Locke who runs the Cardiff R User Group and then from this Stack Overload question.

Here is the script:

SCRIPT

# install these if necessary

library(dplyr)

library(ggplot2)

library(RCurl)

# who has the most researchers in the world?

# download the data

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/numberOfResearchersFTEUNdata_Export_20151019_214919785.csv")

data <- read.csv(text = x)

# extract the data for a particular year - this is a decision - 2011 in this case

# lose some countries as they don't have 2011 data e.g. Australia and India

# option exists to change the year and run again...

year <- 2011

data <- subset(data, data$Time.Period == year)

# rename United Kingdom, US, Rep of Korea and Venezuela

data$Reference.Area <- gsub("United Kingdom of Great Britain and Northern Ireland", "United Kingdom", data$Reference.Area)

data$Reference.Area <- gsub("United States of America", "United States", data$Reference.Area)

data$Reference.Area <- gsub("Republic of Korea", "Korea, Rep.", data$Reference.Area)

data$Reference.Area<- gsub("Venezuela (Bolivarian Republic of)", "Venezuela", data$Reference.Area)

# sort the data

data.sort <- data[order(-data$Observation.Value),]

data.sort$toPlot <- data.sort$Observation.Value

# plot the top 20 in a bar chart....

p <- ggplot(data.sort[1:20,],

aes(x = Reference.Area,

y = toPlot)) +

geom_bar(stat="identity") +

ylab("Number of Researchers (FTE) ") +

xlab("") +

ggtitle(paste("Who has the most Researchers?", year)) +

theme_bw() +

theme(axis.text.x = element_text(angle=45, hjust=1))

# normalise to population

# get the data (it's on github)

x <- getURL("https://raw.githubusercontent.com/datasets/population/master/data/population.csv")

pops <- read.csv(text = x)

# extract year = 2011 and countries that we have data from above.

pops <- subset(pops, pops$Year == year)

# join together the two data sets using full_join() function from dplyr

data.Join <- full_join(data, pops, by = c('Reference.Area'='Country.Name'))

# calculate Researchers per 1,000 population

data.Join$res.per.thou <- data.Join$Observation.Value/data.Join$Value*1000

# sort the data

data.Join.sort <- data.Join[order(-data.Join$res.per.thou),]

data.Join.sort$toPlot <- data.Join.sort$res.per.thou

# this %+% command is useful.

# It allows us to force a different data.frame into an existing ggplot graph object

p1 <- p %+% data.Join.sort[1:20,]

# After doing that we need to change the title and the ylab.

p1 + ylab("Researchers (FTE) per 1,000 population ") +

ggtitle(paste("Who has the highest density of Researchers? ", year))

# Explore the proportion of GDP spent on R&D?

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/grossDomesticExpR&D_GERD_percent_GDP_UNdata_Export_20151019_215514491.csv")

data.GDP <- read.csv(text = x)

# extract the 2011 data

data.GDP <- subset(data.GDP, data.GDP$Time.Period == year)

# rename United Kingdom, US, Rep of Korea and Venezuela

data.GDP$Reference.Area <- gsub("United Kingdom of Great Britain and Northern Ireland", "United Kingdom", data.GDP$Reference.Area)

data.GDP$Reference.Area <- gsub("United States of America", "United States", data.GDP$Reference.Area)

data.GDP$Reference.Area <- gsub("Republic of Korea", "Korea, Rep.", data.GDP$Reference.Area)

data$Reference.Area<- gsub("Venezuela (Bolivarian Republic of)", "Venezuela", data$Reference.Area)

# sort the data

data.GDP.sort <- data.GDP[order(-data.GDP$Observation.Value),]

data.GDP.sort$toPlot <- data.GDP.sort$Observation.Value

# again use %+% command is to force a different data.frameinto an existing ggplot graph object

p2 <- p %+% data.GDP.sort[1:20,]

# and give it the appropriate titles

p2 + ylab("Expenditure on R&D as a Percentage of GDP") +

ggtitle(paste("Relative spending on R&D - ", year))

# how do the two things relate?

data.Join.again <- full_join(data.Join, data.GDP, by = c('Reference.Area'='Reference.Area'))

p3 <- ggplot(data.Join.again,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area)) +

geom_point(aes(size = sqrt(Value/pi)), pch = 21, show_guide = FALSE) + # this give the bubbles

ylab("Expenditure on R&D (Percentage of GDP)") +

xlab("Researchers Per 1000 of the population") +

ggtitle(paste("Who is serious about the knowledge Economy? ", year)) +

theme_bw() +

scale_size_continuous(range=c(1,40))

# label selected circles

wanted <- c("Finland", "United States", "United Kingdom", "China",

"Denmark", "Israel", "Japan", "Iceland", "Singapore",

"France", "Russian Federation", "Korea, Rep.", "Norway")

labels <- NULL

for( i in 1:length(wanted)) {

labels<- rbind(labels, data.Join.again[data.Join.again$Reference.Area == wanted[i], ])

}

p3 + geom_text(data=labels,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area),

colour = "darkgrey",

size = 5, hjust=1, vjust=-0.5)

# Colour by continent inspired by Gapminder

# Data adapted from http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderCountryColors.txt

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/countryColors20151026")

countryColors <- read.csv(text = x)

data.Join.again2 <- full_join(data.Join.again, countryColors, by = c('Reference.Area'='country'))

p4 <- ggplot(data.Join.again2,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area, )) +

geom_point(aes(size = sqrt(Value/pi)), pch = 21, show_guide = FALSE) +

ylab("Expenditure on R&D (Percentage of GDP)") +

xlab("Researchers Per 1000 of the population") +

ggtitle(paste("Who is serious about the knowledge Economy? ", year)) +

theme_bw() +

scale_size_continuous(range=c(1,40))

p4 <- p4 + aes(fill = continent)

labels <- NULL

for( i in 1:length(wanted)) {

labels<- rbind(labels, data.Join.again2[data.Join.again2$Reference.Area == wanted[i], ])

}

p4 + geom_text(data=labels,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area),

colour = "black",

size = 5, hjust=1, vjust=-0.5)

# the facet_wrap is interesting...

p4 + facet_wrap(~ continent)

Showing data from different years, captures different countries:

2012 Data - Lots of data not in the UN database

2010 data - now includes more African countries and India...

Useful link for some of this code: Prof Jennifer Bryan's pages about exploratory data analysis especially this one.

Tuesday, 13 October 2015

Mapping SQL Relay locations using ggplot...

I may have planned to talk about too much in a one hour talk at SQL Relay. In terms of what I planned to show, I wanted to show a little bit of geomapping using ggplot.

Here is the visualisation of the SQL Relay locations drawn with ggplot:

This script illustrates a few interesting things about ggplot:

you can plot polygons using the geom_polygon() function - complex shapes like maps
you can use different data.frames for different 'layers' in a plot
you can sequentially add extra detail to your visualisation - this script uses:

SCRIPT START

# draw a map of the UK with R.

# https://www.students.ncl.ac.uk/keith.newman/r/maps-in-r

# install packages if necessary

library(maps) # Provides functions that let us plot the maps

library(mapdata) # Contains the hi-resolution maps.

library(ggmap) # finds longitude and latitude values

library(ggplot2) # high quality graphing

locs <- c("Nottingham England", "Reading England", "London England", "Bristol England", "Cardiff Wales", "Birmingham England")

locI <- c("N", "R", "L", "Br", "C", "Bi")

locCol <- c("darkgoldenrod1", "yellowgreen", "yellowgreen", "darkgoldenrod1", "blue", "tomato")

lonlat <- geocode(locs)

# draw a simple map of the United Kingdom

map('worldHires',

c('UK'),

xlim=c(-10,2), ylim=c(50,60))

# plot SQL Relay Locations info using base graphics

points(lonlat$lon,

lonlat$lat,

col= locCol,

pch=19,

cex = 1)

Map made with map() and base graphics

# we can also do this in ggplot better...

# make a data.frame

data2 <- as.data.frame(locs)

# bind in longitude, latitude, colours and initials

data2 <- cbind(data2, lonlat, locCol, locI)

# use map_data() function to create a data.frame containing the map of the UK

uk<-map_data("worldHires", region = "UK")

# create the object m with the map using geom_polygon() function

# we can make the map different colours

# see here for more about colours in R

# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

m <- ggplot() +

geom_polygon(data=uk,

aes(x=long, y=lat, group=group),

fill="gray88") + # more about colours

scale_x_continuous(limits = c(-10,3)) +

scale_y_continuous(limits =c(49,60.9))

# add the points on the plot telling us where the longitude and latitude of various places

# N.B. the data is coming from a different data.frame to the map

# first add some black dots

m <- m + geom_point(data=data2,

aes(x=lon, y=lat),

colour="black",

size = 7)

# then some smaller coloured dots

m <- m + geom_point(data=data2,

aes(x=lon, y=lat),

colour=locCol,

size = 6)

# add the letters inside the circles

m <- m + geom_text(data = data2,

aes(x=lon, y=lat, label = locI, size = 4))

# change the theme to make the map nicer

m <- m + theme_bw()

# add a title

m <- m + ggtitle("SQL Relay Locations") +

theme(legend.position="none") # get rid of the legend

m # show the object

Final map was exported as an A5 PDF which looked nice.

Resources:

https://www.students.ncl.ac.uk/keith.newman/r/maps-in-r

Monday, 12 October 2015

What do I tweet about? Not just R...

Here is another script for my talk entitled First steps in making visualisations using R as part of SQLRelay in Cardiff. It's about using R to make a word cloud or wordle.

Here is my word cloud which shows who and what I tweet about:

What I tweet about: too much bread and not enough work?

I downloaded my twitter archive with help from this link. The resulting .csv file is on Github so that it can be downloaded as part of this script.

Below is the script that I used to make the word cloud. It also shows how the word cloud developed - I removed various words that I didn't think were useful. I did this by using the text mining package - tm. I looked at word frequencies and removed the words I didn't like using the tm.map() function.

SCRIPT START:

# if required install.packages(c("wordcloud", "tm", "ggplot2", "RCurl"))

library("wordcloud")
library("tm")
library("ggplot2")
library("RCurl")

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/tweets.csv")

twts <- read.csv(text = x)

str(twts)
# it's a data.frame

text <- twts$text # pull out the text...
str(text) # check the structure
# factor - no I dont' think that's quite what we want...

text <- as.character(twts$text)
str(text)
# so that worked and looks better

# check a few of the entries...
text[1]
text[2]
text[10]

# defaults are not great - quite slow and probably too many words to be useful.
# best to add a max.words modifier....
wordcloud(text, max.words = 35)

wordcloud(text,
max.words = 35,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

Add some colour and get a nice layout...

# layout is random...

# some interesting words here
# my name: brennanpcardiff
# some people I communicate with on twitter: amcunningham, drbillyo, drnostromo
# words like amp, just, the, think, well, day, now, nice, great, new

# good to do some processing....

# convert into a Corpus - a structure for organising text...
text.c <- Corpus(VectorSource(text))
text.c.p <- tm_map(text.c, content_transformer(tolower))
# remove stopwords
text.c.p <- tm_map(text.c.p, removeWords, stopwords("english"))
# get a list of Englisth stopwords....
stopwords(kind = "en")
length(stopwords(kind = "en"))

wordcloud(text.c.p, max.words = 35, colors=brewer.pal(8, "Dark2"))

Not quite what I wanted....

# so some words removed but leaving others... including smiley face.
# remove some Punctuation
text.c.p <- tm_map(text.c.p, removePunctuation)

wordcloud(text.c.p, max.words = 35, colors=brewer.pal(8, "Dark2"))

# better... rstats has appeared - good to see.
# not sure how useful day, the, can, well, like, http are....
# remove more words
text.c.p <- tm_map(text.c.p, removeWords, c("new", "the", "today", "can","just", "day", "amp", "http", "good", "great", "like", "nice", "well", "brennanpcardiff", "will", "now", "httpt", "get", "one"))

set.seed(501)
wordcloud(text.c.p,
max.words = 30,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

# for more on word frequencies and a graph.......

# create Document Term Matrix
dtm <- DocumentTermMatrix(text.c.p)
inspect(dtm)
dim(dtm)

# Plotting Word Frequencies
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)

# make the data frame for ggplot
wf <- data.frame(word = names(freq), freq = freq)
head(wf,20)

wf.freq <- subset(wf, freq > 35) # data frame with abundant words - 23 words

p <- ggplot(wf.freq, aes(word, freq)) +
geom_bar(stat="identity") +
xlab("Frequent words") + # label x-axis
ylab("Frequency") + # label y-axis
ggtitle("Word Frequencies in my tweets") +
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))

p # show the object...

# make another wordle with the frequencies... gives the same plot...
set.seed(501) # gives a reproducible plot...
wordcloud(names(freq), freq, max.words =30,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

Helpful resources that I used:

I learned how to this first from here: https://georeferenced.wordpress.com/2013/01/15/rwordcloud/
http://stackoverflow.com/questions/18446408/adding-stopwords-in-r-tm
Introduction to the text mining (tm) package
More articles on R-bloggers about word clouds (including a plea not to use them!)

Friday, 9 October 2015

Exploring diseases in Wales for SQL Relay...

I am giving a talk entitled First steps in making visualisations using R as part of SQLRelay in Cardiff on Wednesday.

I have prepared a script that I am going to use on the day. The script explores some data from the StatsWales website. One data file describes the percentage of people with various illnesses in Wales and another describes lifestyles.

I have downloaded the files as Excel files. These can be directly imported into R but I found it easier to change them in Excel first.

In this case, I changed the data in the following ways:

by combining the two data sets
by removing the gender breakdown
by simplifying the column names
by simplifying the year names for the combined years

This kind of data munging can be done in R as well but I’m quicker doing it in Excel.

I have put the resulting Excel file ontoGithub so that it can be downloaded and imported into R.

As part of this script I make various graphs using R. The final part of the script makes these graphs:

The script I'll be using for my talk is below. I have included copies of some of graphs that are made along the way.

If you are attending my talk in SQL Relay, it's worth getting the script and trying it out on or before the day. You can cut and paste it from below. You can also get it from Github.

SCRIPT START

## This script is designed to illustrate some 'simple graphs'

# Packages required

# install.packages("readxl", "ggplot2", "reshape2", "gridExtra")

library(readxl) # Excel file importer

library(ggplot2) # high quality graph package

library(reshape2) # for melting data frames

library(gridExtra) # for laying out objects on a page

# this is the link to the data

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/illnssLifeStyleGenderYeardReNames20151006.xlsx"

# the download.file() function downloads and saves the file with the name given

download.file(url=link, destfile="file.xlsx", mode="wb")

# then we can open the file and extract the data using the read_excel() function.

data<- read_excel("file.xlsx")

# look how it turns up in the Global Environment

# have a look at the data

View(data)

# check the object

str(data) # the structure of the object

# look at the column names

colnames(data)

# the titles

# [1] "year" "highBP" "heartcondexBP" "respIll"

# [5] "mentIll" "arthritis" "diabetes" "currTreated"

# [9] "healthFairRpoor" "smoker" "alcoholConsAboveGuide" "alcoholConsBinge"

# [13] "fruitVegGuide" "activeon5" "activeZero" "overwgtObese"

# [17] "obese"

## Using the basic plot() function

# refer to the data within the data.frame

plot(data$year, data$mentIll)

plot(data$year, data$arthritis)

plot(data$year, data$smoker)

## improve one of the plots a little

plot(data$year, data$mentIll,

xlab = "Year",

ylab = "Mental Illness (% of Welsh Population)",

pch = 15)

## pair() function useful to plot scatterplot matrices

pairs(data[,1:7]) # shows the rate of change over 10 years

Useful visualisation using pairs() function

## I recommend the ggplot2 library

p <- ggplot(data, # a data.frame with the data

aes(x=year, y=mentIll)) + # columns of the data.frame

geom_point(colour = "red", size = 3) # type of plot

# this creates an object called p

p # show the object

# modify the object to add a title and change the theme

p <- p + ylab("Mental Illness \n (% Welsh Population)") +

xlab(" ") +

theme_bw()

p # show the object again

# the options of how to customise this graph very varied

# there is some recognised good style but also a lot of preference

p <- p + theme(axis.title.y = element_text(size = 18 )) +

theme(axis.text = element_text(size = 20))

p # show the object again

## Let's make a bar chart

ggplot(data, aes(x=year, y=mentIll)) +

geom_bar(stat="identity") +

ylab("Mental Illness \n (% Welsh Population)")

# check out the bottom axis

# Looks very strange with the "2007.5"

# Why?

str(data$year)

# num [1:11] 2004 2005 2006 2007 2008 ...

mode(data$year)

# [1] "numeric"

# Because R has imported year as a number, adding 0.5 is allowed

# change to a character

data$year <- as.character(data$year)

ggplot(data, aes(x=year, y=mentIll)) + # data.frame & the data

geom_bar(stat="identity") + # different type of plot

ylab("Mental Illness \n (% Welsh Population)")

A better graph with all the numbers included

## Look at a possible correlation

ggplot(data,

aes(x=diabetes, y=obese, label = year)) +

geom_point(colour = "red", size = 5)

ggplot(data,

aes(x=diabetes, y=obese, label = year)) +

geom_point(colour = "red", size = 5) +

geom_text(size = 5, hjust=1, vjust=-0.5)

p <- ggplot(data,

aes(x=diabetes, y=obese, label = year)) +

geom_point(colour = "red", size = 5) +

geom_text(size = 5, hjust=1, vjust=-0.5) +

ylab("Obesity (% of Welsh Pop") +

xlab("Diabetes (% of Welsh Pop)") +

theme_bw()

p <- p + theme(axis.title.y = element_text(size = 18 )) +

theme(axis.title.x = element_text(size = 18 )) +

theme(axis.text = element_text(size = 20))

# introducing facet

# draw six graphs of diseases over time...

# take a subset of data

data.sub <- data[1:7]

# rename the columns

colnames(data.sub) <- c("Year", "High BP", "Heart Conditions", "Respiratory Illness", "Mental Illness", "Arthritis", "Diabetes")

# melt the data from wide to long:

melted.data.sub <- melt(data.sub, id.var = "Year")

colnames(melted.data.sub) <- c("Year", "Illness", "Percent")

# make a plot

x <- ggplot(data = melted.data.sub,

aes(x=Year, y=Percent)) +

geom_point(aes(colour = Illness, shape= Illness, size = 5)) +

theme_bw() +

theme(legend.position = "none") +

ggtitle("Disease Burden in Wales over 10 years")

x # look at the plot

Not very useful!

# break into facets

x <- x + facet_wrap(~ Illness, scales = "free", ncol=2) +

ylab("Percent of Welsh Pop")

x # show the plot

Nice set of graphs, I think.

# save the plot

x + ggsave("DiseaseBurdenWales10years.pdf")

## Extra info on arranging plots on a single page

# make a new graph in the object g1

g1 <- ggplot(data, # a data.frame with the data

aes(x=year, y=obese)) + # columns of the data.frame

geom_point(colour = "black", size = 5) + # type of plot

ylab("Obesity (% of Welsh Pop") +

xlab("") +

theme_bw()

g1 <- g1 + theme(axis.title.y = element_text(size = 18 )) +

theme(axis.title.x = element_text(size = 16 )) +

theme(axis.text = element_text(size = 20))

# make another graph in the object g2

g2 <- ggplot(data,

aes(x=year, y=diabetes)) +

geom_point(colour = "blue", size = 5) +

ylab("Diabetes (% of Welsh Pop") +

xlab("") +

theme_bw()

g2 <- g2 + theme(axis.title.y = element_text(size = 18 )) +

theme(axis.title.x = element_text(size = 16 )) +

theme(axis.text = element_text(size = 20))

# using a function from the gridExtra package...

# put the three graphs on the same graphical output.

grid.arrange(g1, g2 ,p)

# makes the plot at the top of the blog

SCRIPT END

Useful resources:

ggplot home page and documentation - VERY USEFUL with lots of examples.
R Cookbook - useful site about ggplot
R-Bloggers - lots of interesting blogs about ggplot
Quick-R - very readable site - not just ggplot lots about base graphs too.

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Monday, 26 October 2015

Exploring UN data on researchers...

Tuesday, 13 October 2015

Mapping SQL Relay locations using ggplot...

Monday, 12 October 2015

What do I tweet about? Not just R...

Friday, 9 October 2015

Exploring diseases in Wales for SQL Relay...

Useful resources: