R for Biochemists: March 2017

I am working towards creating a Men's Health Data Visualisation for the Battle of the Beards, an IT conference to be held in Cardiff at the end of March.

I was inspired by Professor David Spiegelhalter from the University of Cambridge. I heard him talk at Cardiff University. He has a blog about Uncertainty and this graph is inspired one of his blog posts. I've been reading his book entitled The Norm Chronicles (written with Michael Blastland) - all about risk, life and death.

My starting point for my exploration of Men's Health is a graph that can calculate my chance of dying this year. As my source, I'm using the UK Life Tables which have calculated the chance of dying between birthdays.

I became a bit more worried about dying as I got older and particularly when I became a father. My chance of dying at 47 is 1 in 390. However, I wanted to develop habits that would reduce this chance a little. More about that next time...

START

# I am preparing a session for the Battle of the Beards
# It's an IT conference with difference
# Fight male suicide at “The Battle of the Beards”
# Talking tech & making a difference
# Cardiff | 29 March 2017
# http://battleofthebeards.info/

library(readxl)
library(ggplot2)

# my preparation involves exploring the UK National Life Tables
# these are available through the Office of National Statistics
# as an excel file from this page:
# https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies/datasets/nationallifetablesunitedkingdomreferencetables
# the can be downloaded from here:
ons_url <- c("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies/datasets/nationallifetablesunitedkingdomreferencetables/current/nltuk1315reg.xls")
# to guarantee undistrubed access, I've put the file on github
github_url <- "https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/nltuk1315reg.xls"

# the download.file() function downloads and saves the file with the name given
download.file(url=ons_url,destfile="file.xls", mode="wb")
# if this doesn't work try replacing the url with the github_url
# then we can open the file and extract the data using the read_excel() function.
data<- read_excel("file.xls", sheet = 5, skip = 6)

# need to remove four bottom rows - these are all blank.
data <- data[1:101,]

colnames(data)

# x = age in years
# first set is males and second set is females
# mx is the central rate of mortality, defined as the number of deaths at age x last birthday in the three year period to which the National Life Table
# relates divided by the average population at that age over the same period.

# qx is the mortality rate between age x and (x +1), that is the probability that a person aged x exact will die before reaching age (x +1).

# lx is the number of survivors to exact age x of 100,000 live births of the same sex who are assumed to be subject throughout their lives to the
# mortality rates experienced in the three year period to which the National Life Table relates.

# dx is the number dying between exact age x and (x +1) described similarly to lx, that is dx=lx-lx+1.

# ex is the average period expectation of life at exact age x, that is the average number of years that those aged x exact will live thereafter
# based on the mortality rates experienced in the three year period to which the National Life Table relates.

# to allow us to separate the male and female data, I am relabelling the column names

colnames(data) <- c("age",
"m.mx", "m.qx", "m.lx", "m.dx", "m.ex",
"f.mx", "f.qx", "f.lx", "f.dx", "f.ex")

# the life table allow us to draw some interesting curves
# the first one I want to explore is my chance of dying this year.

# qx is the key value.
# the chance of my dying before my next birthday.

# here is the graph in ggplot
ggplot(data = data,
aes(x = age, y = m.qx)) +
geom_point()

# The log 10 plots give an interesting pattern.
# first shown to me by Professor David Spiegelhalter
# Here it is in my blog https://understandinguncertainty.org/what-your-effective-age
ggplot(data = data,
aes(x = age, y = log10(m.qx))) +
geom_point()

# Another way to change the scale on the y-axis using scale_y_log10()
ggplot(data = data,
aes(x = age, y = m.qx)) +
geom_point() +
scale_y_log10()

# let's make this look a bit nicer with lines, titles and a theme...
# make the object p
p <- ggplot(data = data,
aes(x = age, y = m.qx)) +
geom_line(colour = "red") +
geom_point(colour = "red", size = 0.75) +
ylab("Chance of dying") +
xlab("Current Age") +
ggtitle("Chance of Dying before next Birthday") +
scale_y_log10() +
theme_bw()

p # show the object

# label on y axis is not the most useful, I think.
# good learning point here on customising the y-axis label.
# added as arguments in the scale_y_log10()

p <- ggplot(data = data,
aes(x = age, y = m.qx)) +
geom_line(colour = "red") +
geom_point(colour = "red", size = 0.75) +
ylab("Chance of dying") +
xlab("Current Age") +
ggtitle("Chance of Men Dying before next Birthday") +
scale_y_log10(
breaks = c(0.0001, 0.001, 0.01, 0.1, 0.5), # where to put labels
labels = c("1/10,000", "1/1000", "1/100", "1/10", "1/2")) + # the labels
theme_bw()

p # show the object

# add source of the data
source <- paste("Source: Office of National Statistics\n filename: nltuk1315reg.xls\n accessed:", Sys.Date())
p + annotate("text", x = 70, y = 0.0001, label=source, size=3)

# more info available from the Office of National Statistics here:
# https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies

The magrittr package allows us to write code using 'pipes'. The code is a little easier to read. Easier code to read is easier to share, document and use which appeals to me. I think it makes it more open and I'm a big fan of open science.

Using pipes avoids us piling up our functions on single lines and encourages me to layout my code in a more organised way.
Pipes allows space for documentation, explanations and comments.
I have used pipes to create this cluster diagram to illustrate the point:

Consider adding the use of pipes to your code to make it easier for others.

Here's is the script:

# START
# download the data from github
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/microArrayData.tsv")
data <- read.table(text = x, header = TRUE, sep = "\t")

# when we have a workflow that we like there is a tendency to pile up our functions

# here is an example:
plot(hclust(dist(t(data[2:15]))))

# the introduction of the magrittr piping function into R..x
# allows us to do this in a way that make a work flow easier to view and easier to comment

# install.packages("magrittr") # if required
library(magrittr)
# https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

data[2:15] %>% # subset the object (columns 2:15 of the dataframe)
t() %>% # transform it so that columns are rows
dist() %>% # calculate distance
hclust() %>% # do a hierarchical cluster
plot() # then plot the result.

# this approach really makes life easier when we have arguments in our functions
# we can change the method for the dist() function
# and we can add a title and some colour with the plot() function
# using pipes, this looks like this:

data[2:15] %>% # create a subset of object data (cols 2:15)
t() %>% # transform it so that columns are rows
dist(method = "manhattan") %>% # calc dist with manhattan meth
hclust() %>% # do a hierarchial cluster
plot(main="Cluster Diagram of Drug Treatments\n(2 Mar 2017)", # add a title
lwd=2, col="blue", cex = 1.1) # thick line & change colours

# this code the other way looks like this:
plot(hclust(dist(t(data[2:15]), method = "manhattan")),
main="Cluster Diagram of Drug Treatments\n(2 Mar 2017)", # add a title
lwd=2, col="blue", cex = 1.1)

# it's a little bit difficult to separate the functions, objects and arguments in my opinion

# N.B. three key points to remember about using pipes:
# (1) brackets remain to allow us to identify functions()
# (2) the arguments go within the brackets - not the objects
# (3) objects are now 'piped' into the functions using %>%
# END

There are lots of blog posts about pipes and magrittr. Just search....

Hat tip to Steph Locke and Dave from the Cardiff R User group for encouraging me to use pipes.

A protein assay	An enzyme kinetic plot	Visual Index	Study at Cardiff

R for Biochemists

A site to help Biochemists learn R.

How to use this site...

Starting points

Monday 6 March 2017

Exploring the chance of dying using the UK Life Tables

Thursday 2 March 2017

Using 'pipes' in R for easier reading code