Monday 20 August 2018

Exploring more immunization data...

Last week, inspired by Factfulness, I made a graph showing the BCG immunization coverage for children at 1 year. The Factfulness graph didn't mention any specific immunization programme and World Health Organisation data monitors immunization coverage for other vaccines. For that reason, I thought it would be good to download and graph more of the available global immunization data.

This allowed me to generate this graph which shows immunization coverage across the world for nine different vaccines. The various dates that monitoring starts shows that new immunization programs are being rolled out on a regular basis - good to see.




Below is the code for downloading the data and making the graph...
If you would rather just make the graph with some cleaner data, the data from Aug 20, 2018 is available on github and can be downloaded using the read_csv() code shown about half way down the script.

## START
##  download the data  
library(tidyverse)
# install.packages("WHO")
library(WHO)

# check out the codes of the WHO data...
codes <- get_codes()
# get codes for immunizations
immun_codes <- codes[grepl("[Ii]mmuniz", codes$display), ]
immun_codes$label

# go through each of the 18 to find global data...
# Number 1 has global data

# download number 1
# requires internet access
immun_data <- as.tibble(get_data(immun_codes$label[1]))

# filter for global data
immun_data <- filter(immun_data, region == "(WHO) Global")

# repeat download for next 2 to 18 WHO codes  
# had to do this as a loop as couldn't get it to work using lapply...
# start off with second value as first is above..
# requires internet access and patience...
for(i in 2:length(immun_codes$label)){
    # download the data
    data <- as.tibble(get_data(immun_codes$label[i])) 
    
    #tell you that it has downloaded...
    print(paste("Dataset",immun_codes$label[i], "downloaded."))
    
    # filter the data for Global values
    data <- filter(data, region == "(WHO) Global")
    
    # if there is some data bind_rows()
    if(nrow(data)>1){
        # bind_rows() function from dplyr
        immun_data <- bind_rows(immun_data, data)
    }else{   
        # if not just tell us....
        print("No global data in this set")
    }
}





# reduce columns using select() function  
immun_data <- select(immun_data, gho, region, year, value )

# to avoid having to download every time... save a local copy
file_name <- paste0("global_immun_data", Sys.Date())
write_csv(immun_data, file_name)

## ----read_back if you have saved to continue from here
# immun_data <- read_csv(file_name)

# read in data from github using read_csv() function

# immun_data <- read_csv("https://raw.githubusercontent.com/brennanpincardiff/RforBiochemists/master/data/global_immun_data2018-08-20")


# Let's make our plot...
plot <- ggplot(immun_data, aes(x = year, y = value, 
    colour = gho)) +
  geom_line(size = 1)+ 
  theme(legend.position="none")

plot

# separate the plots with facet wrap
plotf <- plot + facet_wrap(~gho)
plotf


## The individual graph titles are difficult to read
# Shorten them by removing text using gsub() = global substitution
immun_data$gho_s <- gsub("immunization coverage among 1-year-olds",
                            "", immun_data$gho)
immun_data$gho_s <- gsub("immunization coverage by the nationally recommended age",
                        "", immun_data$gho_s)

# make the plot again
plot <- ggplot(immun_data, aes(x = year, y = value, 
    colour = gho_s)) +
  geom_line(size = 1) + 
  theme(legend.position="none")

# separate plots with facet_wrap
plotf <- plot + facet_wrap(~gho_s)
plotf <- plotf + theme_bw() + theme(legend.position="none")
plotf

# improve plot with y limits, titles & source
source <- paste("Source: World Health Organisation, accessed:", Sys.Date())
plotf <- plotf + ylim(0,100)
plotf <- plotf + labs(x = NULL, y = "Immunization Rate",
      title = "Global immunization rates", 
    subtitle = source)
plotf



## END

Some Resources:



Friday 10 August 2018

Exploring some immunization data... one graph from Factfulness..

I've been reading Factfulness - an important book about statistics, data visualization and generally understanding the world. Written by Hans Rosling with Ola Rosling and Anna Rosling Rönnlund of Gapminder, it uses data to try to improve our world view as many of us have an incorrect world view. It is based on the very popular videos by Hans Rosling. Here is one to have a look at...


I like to explore data and there is lots of data in the book. One of the things I like to do is to try to reproduce the graph from publications. So here is a graph from Factfulness:

It shows how the world has improved in terms of immunization. Most of us don't know that almost 90% of children worldwide are immunized.  Here is a graph for just the TB vaccine - data for BCG immunization uptake for children. It's similar to the graph above and it's made in R.



The code uses the WHO R package which allows us to access immunization data from the Global Health Observatory data repository.

Here is the code that makes the graph and some of the steps along the way.

START 

library(tidyverse)
# install.packages("WHO")
library(WHO)
# check out the codes of the WHO data...
# requires internet access
codes <- get_codes()
colnames(codes)



glimpse(codes)  # useful function from tibble package

codes[1:10, 2]

# get codes for immunizations
# search using grepl() function for finding regular expressions
codes[grepl("[Ii]mmuniz", codes$display), ]

# ----downloadBCG data
bcg_data <- get_data("WHS4_543")  # BCG data...
# requires internet access and takes a few minutes...

## ----lookatBCG data
glimpse(bcg_data)

summary(bcg_data)

## ----globalGraph
# extract Global data with the filter() function from dplyr package
# then plot...
bcg_data %>% 
    filter(region == "(WHO) Global") %>% 
    ggplot(aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "BCG Immunization Rates")





## To make the graph look more like Factfulness...
# 1. Add a title
# 2. Set limits to 0 and 100%
# 3. Add source
# 4. Add an point at the start
# 5. Add an arrow at the end
# 6. Add values and years to start and end...
# 7. To change the style to grey under the arrow

## add titles and limits to y axis...
bcg_data %>% 
    filter(region == "(WHO) Global") %>% 
    ggplot(aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "BCG Immunization Rates", 
        title = "BCG Immunization Rates") + 
    ylim(0,100) -> bcg_plot
bcg_plot





# add a source
source <- paste("Source: World Health Organisation \n accessed:",     Sys.Date())
bcg_plot + annotate("text",  x = 2008, y = 10, label=source, size=3)




## Add points and annotation requires more work....
# pull out global data
bcg_data %>% 
    filter(region == "(WHO) Global") -> glob_bcg
# identify minimum and maximum point and labels
min_point <- filter(glob_bcg, year == min(glob_bcg$year))
min_point_label <- paste0(min_point$value, "%\n", min_point$year)
max_point <- filter(glob_bcg, year == max(glob_bcg$year))
max_point_label <- paste0(max_point$value, "%\n", max_point$year)

# Use WHO title - better plan
graph_title <- glob_bcg$gho[1]

# put it together to make a graph
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
    geom_line(size = 1) +
    labs(x = NULL, y = "Immunization Rate",
        title = graph_title) + 
    ylim(0,100) +
    geom_point(data = min_point, aes(year, value)) +
    geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
    geom_point(data = max_point, aes(year, value), 
        shape = 62, size = 5) +
    geom_text(data = max_point, label=max_point_label, vjust=1.2)

bcg_plot




## add grey underneath the line
# this uses geom_ribbon()
bcg_plot + geom_ribbon(aes(ymin=0, ymax=value), 
    fill = "#DCDCDC") # this colour is a nice grey. 



# works but overlays earlier annotation. 

## ----change order to make it all work together for final plot
bcg_plot <- ggplot(glob_bcg, aes(x = year, y = value)) +
    geom_ribbon(aes(ymin=0, ymax=value), fill = "#DCDCDC") +
    geom_line(size = 1) +
    theme_bw() +
    labs(x = NULL, y = "Immunization Rate",
        title = graph_title) + 
    ylim(0,100) +
    geom_point(data = min_point, aes(year, value), size = 5) +
    geom_text(data = min_point, label=min_point_label, hjust=-0.8) +
    geom_point(data = max_point, aes(year, value), 
        shape = 62, size = 8) +
    geom_text(data = max_point, label=max_point_label, vjust=1.2) +
    annotate("text",  x = 2008, y = 10, label=source, size=3)

bcg_plot



END


Some Resources:

Friday 22 June 2018

Creating a drawProteins hex sticker...

Lots of Bioconductor packages have created hex stickers - those fun stickers that go on the top of your laptop. They are a bit nerdy but, hey, who doesn't like nerdy 😀

I thought I should create a hex sticker for drawProteins since it's now available in Bioconductor.

It's been created totally in R using the hexSticker package from the talented GuangchuangYu.

Here is the sticker:


Below is the R code that makes the sticker. I've chosen to show three of NF-kappaB proteins but the code could be adapted to illustrate your favourite protein. You can also customise the sticker colour and lots of other things.

START
# script to create drawProteins sticker
# setwd("/Users/paulbrennan/Documents/drawProteins_hexsticker")
library(ggplot2)
# install.packages("hexSticker")
library(hexSticker)
library(drawProteins)

data("five_rel_data")
p <- draw_canvas(five_rel_data)
p <- draw_chains(p, five_rel_data, label_chains = FALSE, size =0.2)
p <- draw_domains(p, five_rel_data,
  show.legend = FALSE,
  label_size = 2)
p

p <- draw_phospho(p, five_rel_data)
p <- p  + theme_void()

sticker(p, package="drawProteins", p_size=6, s_x=.98,
  s_y=.88, s_width=1.4, s_height=1.1,h_fill = "#114466", h_color="#001030")


# this doesn't quite work :-(
# size of text is wrong... no labels... cramped!
# I think five proteins is probably too many...

# reduce to three....
# P19838
# Q04206
# Q04864

# get from UniProt rather than using internal data
prot_data <- drawProteins::get_features("Q04206 Q04864 P19838")

# turn the data into a dataframe
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# create canvas
p <- draw_canvas(prot_data)

# customise the labels
p <- draw_chains(p, prot_data, label_size = 1.4,
  labels = c("c-Rel",
    "p50/p105",
    "p50/p105",
    "p65/Rel A"),
  size=0.2)

# draw domains without a legend or labels
p <- draw_domains(p, prot_data, show.legend = FALSE, label_domains = FALSE)

# add text and control size with geom_text rather than labels from drawProteins
p <- p + ggplot2::geom_text(data = prot_data[prot_data$type == "DOMAIN", ],
  ggplot2::aes(x = begin + (end-begin)/2,
               y = order,
               label = description),
           size = 1.3)

# keep phosphorylation sites small
p <- draw_phospho(p, prot_data, size = 1)

# a completely blank theme using theme_void()
p <- p  + theme_void()
p

# make the sticker using the ggplot object entitled p
# creates a png file
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org")
# automatically creates the PNG file with the name drawProteins.


# make the PDF
sticker(p,
  package="drawProteins",
  p_size=6,  # font size for package name
  s_x=.95,   # x position for ggplot object
  s_y=.86,   # y position for ggplot object
  s_width=1.4,
  s_height=1.1,
  h_fill = "#e8a935",   # background colour for hexagon
  h_color="#001030",    # border colour
  url = "www.bioconductor.org",
  filename="drawProteins.pdf")

END


Some Resources:

For more help, bug reports or to suggest features


Tuesday 13 February 2018

Plotting disease causing myosin variants using drawProteins

My drawProteins package has been accepted by Bioconductor. It's currently available as a development version -  a nice conclusion of lots of effort.

Last week, I had my first query about drawProteins from Kevin T. Booth, a graduate student at the University of Iowa. Kevin wanted to plot disease causing protein variants.

Kevin's working on the MYO7A gene (Uniprot ID “Q13402”). Kevin told me that "alterations in MYO7A can result in 3 different phenotypes-Autosomal Dominant non-syndromic hearing loss (DFNA11),  Autosomal Recessive non-syndromic hearing loss (DFNB2) or Usher Syndrome Type 1 (USH1B), another type of inherited deafness.

Looking at the Uniprot page for MYO7A gene, it contains a long list of disease associated variants. We created a visualisation of these variants using the geom_point() function to apply a ggplot2 layer to a schematic of the protein.




Here is the script that generated this and a few other images on the way. Feedback welcome.

SCRIPT START
# exploring plotting variants
# using data from Q13402

# download the protein data first...

library(drawProteins)
library(ggplot2)
# pulling down data for Kevin's protein - MYO7A
prot_data <- drawProteins::get_features("Q13402")
# produce data frame
prot_data <- drawProteins::feature_to_dataframe(prot_data)

# make protein schematic
p <- draw_canvas(prot_data)
p <- draw_chains(p, prot_data)
p <- draw_domains(p, prot_data)
p





# nice domain structure
# note that it is a long protein > 2000 amino acids. 

p <- p + theme_bw(base_size = 14) + # white background
    theme(panel.grid.minor=element_blank(), 
        panel.grid.major=element_blank()) +
    theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
    theme(panel.border = element_blank())
p




# EXPLORE VARIANTS
# many VARIANTS
# count them...
nrow(prot_data[prot_data$type == "VARIANT",])
# ans = 74

# these can be plotted using geom_point() function
p + geom_point(data = prot_data[prot_data$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25))


# wanted focus on disease variants and colour the by disease...
# so need to change the info...
# colour or shape by info within it...

# duplicate prot_data
prot_data2 <- prot_data
# duplicate description..
prot_data2$disc2 <- prot_data$description
# remove word "in"
prot_data2$disc2 <- gsub("in ", "", prot_data2$disc2)

# pull out the disease name...
library(stringr)
temptable <- str_split_fixed(prot_data2$disc2, ";|:", 2)
prot_data2 <- cbind(prot_data2, temptable)
# name columns
tempnames <- colnames(prot_data2)
tempnames[11] <- "disease"
colnames(prot_data2) <- tempnames

# add colour argument within aesthetics
p + geom_point(data = prot_data2[prot_data2$type == "VARIANT",],
    aes(x = begin,
        y = order+0.25,
        colour = disease))



# exclude some of the VARIANTS as not disease causing..
# dbSNPs, rare polymorphism and the blank one 
prot_data2$disease[prot_data2$disease=="dbSNP"] <- NA
prot_data2$disease[prot_data2$disease==""] <- NA
prot_data2$disease[prot_data2$disease=="rare polymorphism"] <- NA

# clean up variant list for easier legend
disease_var_data <- prot_data2[prot_data2$type == "VARIANT",]
disease_var_data <- na.omit(disease_var_data)
disease_var_data$disease <- gsub("found patients with ", 
    "", disease_var_data$disease)
disease_var_data$disease <- gsub("found a patient with ", 
    "", disease_var_data$disease)

# add colour argument within aesthetics
p + geom_point(data = disease_var_data,
    aes(x = begin,
        y = order+0.25,
        colour = disease))

# over plotting is an problem 
# enlarge canvas and explore geom_jitter()
p <- p + ylim(0.75, 3)
p <- p + geom_jitter(data = disease_var_data,
            aes(x = begin,
                y = 2,
                colour = disease),
            width = 0, height = 0.75, 
            size = 2)


# add titles
myo7A_subtitle <- paste0("circles = Disease Variants\n",
    "source: Uniprot accessed 8 Feb 2018")

p <- p + labs(title = "Schematic of human Unconventional myosin-VIIa",
    subtitle = myo7A_subtitle)
p


SCRIPT END

Some Resources

For more help, bug reports or to suggest features