R for Biochemists: 2015

Wednesday, 16 December 2015

Counting cell nuclei in an image

I have been working with a colleague, Dr Joaquin de Navascues from the European Cancer Stem Cell Research Institute, to develop a workshop entitled an "Introduction to Biological Image Analysis". We plan to discuss the fundamentals of digital images and how to work with them in FIJI (Image J) and R, two open source data analysis tools. We aim to deliver this workshop during March or April next year at Cardiff University.

Joaquin is taking the lead on working with FIJI and I am developing the R material. For biological image analysis there is a useful package called EBImage. This has a nice introduction available and a detailed handbook - (version: 4.13.5).

I have been using this package to count cell nuclei in an image. The image is a microscope picture of a Drosophila gut. Counting nuclei involves mathematical transformations of the digital data. A digital image is a matrix of numbers.

In non-technical language the key steps are:

blur the image
apply a threshold to turn nuclei into 'blobs'
count the 'blobs'

The output from this script is:

Number of nuclei in this image = 92

The script below downloads an image from Github, opens the image, displays it, transforms it and then counts the nuclei. Because I plan to count nuclei from more than one image, I have made a function and then applied it to the downloaded file. Using user defined functions to automate your workflow is a very good use of R.

SCRIPT:

# to install use this:
# source("http://bioconductor.org/biocLite.R")
# biocLite("EBImage")

library(EBImage) # you might need to install - see above

# the image is on Github
# it is from a set of cells that are stained to detect the nuclei

# this is the link to the data

link <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/seq/seq_z015_c003.tif"

# the download.file() function downloads and saves the file

download.file(url=link, destfile="file.tif", mode="wb")

# EBImage uses the readImage() function to load the file.

img1 <- readImage("file.tif")

display(img1, method = "raster") # shows the image within R.

display(img1*4, method = "raster") # multiply the image to make brighter

# I have written a function to count nuclei

# includes blurring the image, applying a threshold and counting....

# it displays the image as it is changed.

# it's not perfect and overestimates the number of nuclei.

# it's an example that can be done.

# improving and customizing the various options is very feasible.

countNuclei <- function(img1){

# blur the image

w = makeBrush(size = 11, shape = 'gaussian', sigma = 5) # makes the blurring brush

img_flo = filter2(img1*2, w) # apply the blurring filter

display(img_flo * 4, method = "raster") # display the blurred image - brighter for display only.

# apply a threshold

nmaskt = thresh(img_flo *2, w=10, h=10, offset=0.05)

display(nmaskt, method = "raster")

# the bwlabel() function 'counts' the blobs

nucNo <- max(bwlabel(nmaskt))

# this outputs the count to us

cat('Number of nuclei in this image =', max(bwlabel(nmaskt)),'\n')

return(nucNo)

}

# this applies the function to the image

nucNo <- countNuclei(img1)

END OF SCRIPT

Tuesday, 10 November 2015

Drawing a protein domain structure using R....

As a biochemist, I like a protein structure. Today, I have written a script to draw a protein structure for one of my favourite proteins, the NF-kappaB subunit, p65 - also known as Rel A. I've been measuring NF-kappaB since I published my first paper from my PhD in Trinity College Dublin (many years ago) and one of my most cited papers from Cardiff University is one of the first measurement of Rel A in a primary cells from a human cancer - chronic lymphocytic leukaemia.

The diagram is created using information from the Uniprot webpage for p65. It shows the domain structure for p65. Here is the diagram:

For those interested, RHD stands for Rel Homology Domain and TAD stands for Transactivation Domain. For more information, here's a link to the Wikipedia page.

Here is the script that draws the diagram:

SCRIPT START

# draw the NF-kappaB subunit, Rel A (p65) with R.

# draw it as a series of rectanges

# going from left to right.

# using the Uniprot webpage for the informaton

# http://www.uniprot.org/uniprot/Q04206

# cut and paste from the XML page to create two objects

# first object is a list containing accession number and names

## Step 1: a list containing names and details

# list of names...

names <- list(

accession = c("Q04206"),

name = "TF65_HUMAN",

protein.recommendedName.fullName = "Transcription factor p65",

protein.recommendedName.alternativeName = "Nuclear factor NF-kappa-B p65 subunit",

protein.fullName = "Nuclear factor of kappa light polypeptide gene enhancer in B-cells 3",

gene.name.primary = "RELA",

gene.name.synonym = "NFKB3",

organism.name.scientific = "Homo sapiens"

)

## step 2: create the data frame with all the information.

# cut and past from the XML page to make vectors

# features to plot

types <- c("chain", "domain", "region of interest", "short sequence motif", "short sequence motif")

description <- c("Transcription factor p65", "RHD", "Activation domain","Nuclear localization signal", "9aaTAD")

begin <- c(1,19, 415, 301, 536)

end <- c(551, 306, 459, 304, 544)

col <- c("white", "blue", "red", "black", "orange")

# assemble vectors into a data frame

features <- data.frame(types, description, begin, end, col)

# check the structure of the data frame

str(features)

# shows description and col to be factors - this will cause problems later...

# so change them now

features$description <- as.character(features$description)

features$col <- as.character(features$col)

# it will be better if we sort features in order of where they begin

features <- features[order(features$begin),]

## step 3: draw the diagram

screen.width <- max(features$end)

screen.height <- 25 # this is a bit arbitary

plot(c(-10, screen.width),

c(0, screen.height),

type= "n",

xlab = "Number of amino acids",

ylab = "", yaxt='n') # suppress the y label and y axis

# make the rectangles in a loop

for (i in 1:length(features$types) ) {

rect(xleft = features$begin[i],

ytop = screen.height/2 + 2.5,

ybottom = screen.height/2 - 2.5,

xright = features$end[i],

col = features$col[i])

}

# add text to the top of the illustration with the recommended name

text(max(features$end)/2, screen.height-2.5, names$protein.recommendedName.fullName, cex=1.5)

# and the alternative name

text(max(features$end)/2, screen.height-5, names$protein.recommendedName.alternativeName, cex=1)

# add the descriptions of the features

pos.text.x <- features$begin[2:5] + (features$end[2:5] - features$begin[2:5])/2

pos.text.y <- c(screen.height/2 + 3.5, screen.height/2 - 3.5)

text(pos.text.x, pos.text.y, features$description[2:5], cex=1, col=features$col[2:5])

# add the accession number to the bottom smaller text and the source of the data

text(max(features$end)/2, 5 , paste("Uniprot Accession Number:", names$accession), cex=0.8)

text(max(features$end)/2, 3 , "Souce of data: http://www.uniprot.org/uniprot/Q04206", cex=0.8)

Monday, 26 October 2015

Exploring UN data on researchers...

Last Tuesday was World Statistics Day 2015 as declared by the United Nations. I took the opportunity to play with some of the data about the number of researchers around the globe and made a visualisation about them. The visualisation looks at the possible relationship between the number of researchers per 1000 population and the amount of money spent as a proportion of GDP. The style is inspired by Gapminder and the page by Professor Jennifer Bryan where the countries are sized by population and coloured by continent.

Here is the visualisation:

To make this visualisation, I had to download data from the UN website about the number of researchers. The screen grab is here:

and here:

My understanding of the Terms and Conditions suggests that I'm allowed to put the data I downloaded onto Github which I have done to allow it to be downloaded within R. The filename has the date and time of download so that I know when I did it.

Here is the script I wrote while playing with the data and some graphs I made along the way.

The script illustrates one interesting feature of ggplot which is the ability to force a different data frame into an existing plotting object using the %+% command. I heard about %+% first from Steph Locke who runs the Cardiff R User Group and then from this Stack Overload question.

Here is the script:

SCRIPT

# install these if necessary

library(dplyr)

library(ggplot2)

library(RCurl)

# who has the most researchers in the world?

# download the data

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/numberOfResearchersFTEUNdata_Export_20151019_214919785.csv")

data <- read.csv(text = x)

# extract the data for a particular year - this is a decision - 2011 in this case

# lose some countries as they don't have 2011 data e.g. Australia and India

# option exists to change the year and run again...

year <- 2011

data <- subset(data, data$Time.Period == year)

# rename United Kingdom, US, Rep of Korea and Venezuela

data$Reference.Area <- gsub("United Kingdom of Great Britain and Northern Ireland", "United Kingdom", data$Reference.Area)

data$Reference.Area <- gsub("United States of America", "United States", data$Reference.Area)

data$Reference.Area <- gsub("Republic of Korea", "Korea, Rep.", data$Reference.Area)

data$Reference.Area<- gsub("Venezuela (Bolivarian Republic of)", "Venezuela", data$Reference.Area)

# sort the data

data.sort <- data[order(-data$Observation.Value),]

data.sort$toPlot <- data.sort$Observation.Value

# plot the top 20 in a bar chart....

p <- ggplot(data.sort[1:20,],

aes(x = Reference.Area,

y = toPlot)) +

geom_bar(stat="identity") +

ylab("Number of Researchers (FTE) ") +

xlab("") +

ggtitle(paste("Who has the most Researchers?", year)) +

theme_bw() +

theme(axis.text.x = element_text(angle=45, hjust=1))

# normalise to population

# get the data (it's on github)

x <- getURL("https://raw.githubusercontent.com/datasets/population/master/data/population.csv")

pops <- read.csv(text = x)

# extract year = 2011 and countries that we have data from above.

pops <- subset(pops, pops$Year == year)

# join together the two data sets using full_join() function from dplyr

data.Join <- full_join(data, pops, by = c('Reference.Area'='Country.Name'))

# calculate Researchers per 1,000 population

data.Join$res.per.thou <- data.Join$Observation.Value/data.Join$Value*1000

# sort the data

data.Join.sort <- data.Join[order(-data.Join$res.per.thou),]

data.Join.sort$toPlot <- data.Join.sort$res.per.thou

# this %+% command is useful.

# It allows us to force a different data.frame into an existing ggplot graph object

p1 <- p %+% data.Join.sort[1:20,]

# After doing that we need to change the title and the ylab.

p1 + ylab("Researchers (FTE) per 1,000 population ") +

ggtitle(paste("Who has the highest density of Researchers? ", year))

# Explore the proportion of GDP spent on R&D?

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/grossDomesticExpR&D_GERD_percent_GDP_UNdata_Export_20151019_215514491.csv")

data.GDP <- read.csv(text = x)

# extract the 2011 data

data.GDP <- subset(data.GDP, data.GDP$Time.Period == year)

# rename United Kingdom, US, Rep of Korea and Venezuela

data.GDP$Reference.Area <- gsub("United Kingdom of Great Britain and Northern Ireland", "United Kingdom", data.GDP$Reference.Area)

data.GDP$Reference.Area <- gsub("United States of America", "United States", data.GDP$Reference.Area)

data.GDP$Reference.Area <- gsub("Republic of Korea", "Korea, Rep.", data.GDP$Reference.Area)

data$Reference.Area<- gsub("Venezuela (Bolivarian Republic of)", "Venezuela", data$Reference.Area)

# sort the data

data.GDP.sort <- data.GDP[order(-data.GDP$Observation.Value),]

data.GDP.sort$toPlot <- data.GDP.sort$Observation.Value

# again use %+% command is to force a different data.frameinto an existing ggplot graph object

p2 <- p %+% data.GDP.sort[1:20,]

# and give it the appropriate titles

p2 + ylab("Expenditure on R&D as a Percentage of GDP") +

ggtitle(paste("Relative spending on R&D - ", year))

# how do the two things relate?

data.Join.again <- full_join(data.Join, data.GDP, by = c('Reference.Area'='Reference.Area'))

p3 <- ggplot(data.Join.again,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area)) +

geom_point(aes(size = sqrt(Value/pi)), pch = 21, show_guide = FALSE) + # this give the bubbles

ylab("Expenditure on R&D (Percentage of GDP)") +

xlab("Researchers Per 1000 of the population") +

ggtitle(paste("Who is serious about the knowledge Economy? ", year)) +

theme_bw() +

scale_size_continuous(range=c(1,40))

# label selected circles

wanted <- c("Finland", "United States", "United Kingdom", "China",

"Denmark", "Israel", "Japan", "Iceland", "Singapore",

"France", "Russian Federation", "Korea, Rep.", "Norway")

labels <- NULL

for( i in 1:length(wanted)) {

labels<- rbind(labels, data.Join.again[data.Join.again$Reference.Area == wanted[i], ])

}

p3 + geom_text(data=labels,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area),

colour = "darkgrey",

size = 5, hjust=1, vjust=-0.5)

# Colour by continent inspired by Gapminder

# Data adapted from http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderCountryColors.txt

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/countryColors20151026")

countryColors <- read.csv(text = x)

data.Join.again2 <- full_join(data.Join.again, countryColors, by = c('Reference.Area'='country'))

p4 <- ggplot(data.Join.again2,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area, )) +

geom_point(aes(size = sqrt(Value/pi)), pch = 21, show_guide = FALSE) +

ylab("Expenditure on R&D (Percentage of GDP)") +

xlab("Researchers Per 1000 of the population") +

ggtitle(paste("Who is serious about the knowledge Economy? ", year)) +

theme_bw() +

scale_size_continuous(range=c(1,40))

p4 <- p4 + aes(fill = continent)

labels <- NULL

for( i in 1:length(wanted)) {

labels<- rbind(labels, data.Join.again2[data.Join.again2$Reference.Area == wanted[i], ])

}

p4 + geom_text(data=labels,

aes(x = res.per.thou,

y = Observation.Value.y,

label = Reference.Area),

colour = "black",

size = 5, hjust=1, vjust=-0.5)

# the facet_wrap is interesting...

p4 + facet_wrap(~ continent)

Showing data from different years, captures different countries:

2012 Data - Lots of data not in the UN database

2010 data - now includes more African countries and India...

Useful link for some of this code: Prof Jennifer Bryan's pages about exploratory data analysis especially this one.

Tuesday, 13 October 2015

Mapping SQL Relay locations using ggplot...

I may have planned to talk about too much in a one hour talk at SQL Relay. In terms of what I planned to show, I wanted to show a little bit of geomapping using ggplot.

Here is the visualisation of the SQL Relay locations drawn with ggplot:

This script illustrates a few interesting things about ggplot:

you can plot polygons using the geom_polygon() function - complex shapes like maps
you can use different data.frames for different 'layers' in a plot
you can sequentially add extra detail to your visualisation - this script uses:

SCRIPT START

# draw a map of the UK with R.

# https://www.students.ncl.ac.uk/keith.newman/r/maps-in-r

# install packages if necessary

library(maps) # Provides functions that let us plot the maps

library(mapdata) # Contains the hi-resolution maps.

library(ggmap) # finds longitude and latitude values

library(ggplot2) # high quality graphing

locs <- c("Nottingham England", "Reading England", "London England", "Bristol England", "Cardiff Wales", "Birmingham England")

locI <- c("N", "R", "L", "Br", "C", "Bi")

locCol <- c("darkgoldenrod1", "yellowgreen", "yellowgreen", "darkgoldenrod1", "blue", "tomato")

lonlat <- geocode(locs)

# draw a simple map of the United Kingdom

map('worldHires',

c('UK'),

xlim=c(-10,2), ylim=c(50,60))

# plot SQL Relay Locations info using base graphics

points(lonlat$lon,

lonlat$lat,

col= locCol,

pch=19,

cex = 1)

Map made with map() and base graphics

# we can also do this in ggplot better...

# make a data.frame

data2 <- as.data.frame(locs)

# bind in longitude, latitude, colours and initials

data2 <- cbind(data2, lonlat, locCol, locI)

# use map_data() function to create a data.frame containing the map of the UK

uk<-map_data("worldHires", region = "UK")

# create the object m with the map using geom_polygon() function

# we can make the map different colours

# see here for more about colours in R

# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

m <- ggplot() +

geom_polygon(data=uk,

aes(x=long, y=lat, group=group),

fill="gray88") + # more about colours

scale_x_continuous(limits = c(-10,3)) +

scale_y_continuous(limits =c(49,60.9))

# add the points on the plot telling us where the longitude and latitude of various places

# N.B. the data is coming from a different data.frame to the map

# first add some black dots

m <- m + geom_point(data=data2,

aes(x=lon, y=lat),

colour="black",

size = 7)

# then some smaller coloured dots

m <- m + geom_point(data=data2,

aes(x=lon, y=lat),

colour=locCol,

size = 6)

# add the letters inside the circles

m <- m + geom_text(data = data2,

aes(x=lon, y=lat, label = locI, size = 4))

# change the theme to make the map nicer

m <- m + theme_bw()

# add a title

m <- m + ggtitle("SQL Relay Locations") +

theme(legend.position="none") # get rid of the legend

m # show the object

Final map was exported as an A5 PDF which looked nice.

Resources:

https://www.students.ncl.ac.uk/keith.newman/r/maps-in-r

Monday, 12 October 2015

What do I tweet about? Not just R...

Here is another script for my talk entitled First steps in making visualisations using R as part of SQLRelay in Cardiff. It's about using R to make a word cloud or wordle.

Here is my word cloud which shows who and what I tweet about:

What I tweet about: too much bread and not enough work?

I downloaded my twitter archive with help from this link. The resulting .csv file is on Github so that it can be downloaded as part of this script.

Below is the script that I used to make the word cloud. It also shows how the word cloud developed - I removed various words that I didn't think were useful. I did this by using the text mining package - tm. I looked at word frequencies and removed the words I didn't like using the tm.map() function.

SCRIPT START:

# if required install.packages(c("wordcloud", "tm", "ggplot2", "RCurl"))

library("wordcloud")
library("tm")
library("ggplot2")
library("RCurl")

x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/tweets.csv")

twts <- read.csv(text = x)

str(twts)
# it's a data.frame

text <- twts$text # pull out the text...
str(text) # check the structure
# factor - no I dont' think that's quite what we want...

text <- as.character(twts$text)
str(text)
# so that worked and looks better

# check a few of the entries...
text[1]
text[2]
text[10]

# defaults are not great - quite slow and probably too many words to be useful.
# best to add a max.words modifier....
wordcloud(text, max.words = 35)

wordcloud(text,
max.words = 35,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

Add some colour and get a nice layout...

# layout is random...

# some interesting words here
# my name: brennanpcardiff
# some people I communicate with on twitter: amcunningham, drbillyo, drnostromo
# words like amp, just, the, think, well, day, now, nice, great, new

# good to do some processing....

# convert into a Corpus - a structure for organising text...
text.c <- Corpus(VectorSource(text))
text.c.p <- tm_map(text.c, content_transformer(tolower))
# remove stopwords
text.c.p <- tm_map(text.c.p, removeWords, stopwords("english"))
# get a list of Englisth stopwords....
stopwords(kind = "en")
length(stopwords(kind = "en"))

wordcloud(text.c.p, max.words = 35, colors=brewer.pal(8, "Dark2"))

Not quite what I wanted....

# so some words removed but leaving others... including smiley face.
# remove some Punctuation
text.c.p <- tm_map(text.c.p, removePunctuation)

wordcloud(text.c.p, max.words = 35, colors=brewer.pal(8, "Dark2"))

# better... rstats has appeared - good to see.
# not sure how useful day, the, can, well, like, http are....
# remove more words
text.c.p <- tm_map(text.c.p, removeWords, c("new", "the", "today", "can","just", "day", "amp", "http", "good", "great", "like", "nice", "well", "brennanpcardiff", "will", "now", "httpt", "get", "one"))

set.seed(501)
wordcloud(text.c.p,
max.words = 30,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

# for more on word frequencies and a graph.......

# create Document Term Matrix
dtm <- DocumentTermMatrix(text.c.p)
inspect(dtm)
dim(dtm)

# Plotting Word Frequencies
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)

# make the data frame for ggplot
wf <- data.frame(word = names(freq), freq = freq)
head(wf,20)

wf.freq <- subset(wf, freq > 35) # data frame with abundant words - 23 words

p <- ggplot(wf.freq, aes(word, freq)) +
geom_bar(stat="identity") +
xlab("Frequent words") + # label x-axis
ylab("Frequency") + # label y-axis
ggtitle("Word Frequencies in my tweets") +
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))

p # show the object...

# make another wordle with the frequencies... gives the same plot...
set.seed(501) # gives a reproducible plot...
wordcloud(names(freq), freq, max.words =30,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))

Helpful resources that I used:

I learned how to this first from here: https://georeferenced.wordpress.com/2013/01/15/rwordcloud/
http://stackoverflow.com/questions/18446408/adding-stopwords-in-r-tm
Introduction to the text mining (tm) package
More articles on R-bloggers about word clouds (including a plea not to use them!)

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Wednesday, 16 December 2015

Counting cell nuclei in an image

Tuesday, 10 November 2015

Drawing a protein domain structure using R....

Monday, 26 October 2015

Exploring UN data on researchers...

Tuesday, 13 October 2015

Mapping SQL Relay locations using ggplot...

Monday, 12 October 2015

What do I tweet about? Not just R...