Thursday, 30 April 2020

Exploring Phantom ticket prices with Tidy Tuesday...

Not biological data but an interesting Tidy Tuesday data set.

It is about Broadway Shows. Here is a plot that shows that the Phantom of the Opera seems to have become relative cheaper in a more diverse market.



START
# a script for Tidy Tuesday and Tidy Thursday at CaRdiff UseR group 
# 30 April 2020

library(tidyverse)
# pull in the data
grosses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-28/grosses.csv', guess_max = 40000)

# How many unique shows...
length(unique(grosses$show))
# 1122

# inspration from here: https://github.com/teunbrand/tidytuesdayscripts/blob/master/scripts/2020_04_28_Broadway_Grosses.R
# and image from here: https://twitter.com/TeunvandenBrand/status/1255253561535074306/photo/1

# first make basic graph = average ticket prices over time...

ggplot(grosses, aes(week_ending, avg_ticket_price)) +
    geom_point(alpha = 0.25)
# $500 shows depresses the axis a bit...

grosses %>%
    filter(avg_ticket_price > 400) -> expensive
# this show is Springsteen on Broadway at $500!!!
# can exclude by limiting the y-axis

# filter for the Phantom
grosses %>%
    filter(show == "The Phantom of the Opera") %>%
    ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
    geom_point() +
    geom_smooth() + 
    labs(x = "",
        y = "Average Ticket Price ($)", 
        title = "Phantom of the Opera",
        subtitle = "Data from Playbill via Tidy Tuesday") 

# why does the show seem very expensive for some weeks rather than others

grosses %>%
    filter(show == "Mamma Mia!") %>%
    ggplot(aes(week_ending, avg_ticket_price, colour=pct_capacity)) +
    geom_point() +
    geom_smooth() + 
    labs(x = "",
        y = "Average Ticket Price ($)", 
        title = "Phantom of the Opera",
        subtitle = "Data from Playbill via Tidy Tuesday") 



# how does Phantom of the Opera compare to the others?
# make one plot first
plot <- ggplot(grosses, aes(week_ending, avg_ticket_price)) +
    geom_point(alpha = 0.25, size = 0.5) + ylim(0,300)

# filter Phantom
grosses %>%
    filter(show == "The Phantom of the Opera") -> p_of_op

# add to the first plot...
plot2 <- plot +
    geom_point(data = p_of_op, aes(week_ending, avg_ticket_price),
        colour = "red") 

# show with titles
plot2 + theme_bw() +
    labs(x = "",
        y = "Average Ticket Price ($)", 
        title = "Phantom of the Opera (red): relatively cleaper in a more diverse market",
        subtitle = "More expensive some weeks!\nData from Playbill via Tidy Tuesday")


END


Resources:

Thursday, 23 April 2020

Analysing an ELISA standard curve...

As part of R for Biochemists 101, the training programme created for the Biochemical Society, one of the participants, Abigail Byford (a British Heart Foundation 4-year PhD student from Leeds Institute of Cardiovascular and Metabolic Medicine) shared some of her data.  She performed an ELISA for human chorionic gonadotropin (hCG) - a pregnancy hormone.

An ELISA generated colour that is read using a 96 well plate spectrophotometer that measures absorbance at a specific wavelength.

This script illustrates working with 96 well plate data, working with a standard curve and calculating unknowns.



--- START ---
library(readxl)
library(ggplot2)
library(dplyr)

# import data from Excel file no Github
# this comes straight from the spectrophotometer
link <- "https://github.com/brennanpincardiff/RforBiochemists/raw/master/data/hCG_absorbances.xlsx"

download.file(url=link, destfile="hCG_absorbances.xlsx", mode="wb")

hcg_file <- read_xlsx("hCG_absorbances.xlsx")

wavelength <- hcg_file[17,2]
days <- hcg_file[5,2]
# Excel stores the number of days since Jan-0-1900
# https://stackoverflow.com/questions/25158969/read-xlsx-reading-dates-wrong-if-non-date-in-column
# adding days to a date:
# https://stackoverflow.com/questions/10322035/r-adding-days-to-a-date
date <- as.Date("1900-01-01") + as.numeric(days) - 2
a

# this allows us a first look at the file...
# there is lots of metadata at the top which we don't need right now..

# we can skip at least the first 23 rows

abs_data <- read_xlsx("hCG_absorbances.xlsx", skip = 23)

# standard curve values are:
hcg <- c(5, 50, 200, 500, 1000, 5, 50, 200, 500, 1000)

# these correspond to A to E in columns 1 and 2
# we can pull these out with subsetting [rows, colums]
abs_data[1:5,2]  # gives first set of standards

# using c() and unlist() will turn these into a vector
abs_std <- c(unlist(abs_data[1:5,2]), unlist(abs_data[1:5,3]))

# plot the data
plot(abs_std~hcg)



# create the dataframe
stand_curve <- data.frame(hcg, abs_std)

# draw the graph...
ggplot(stand_curve, aes(x = hcg, y = abs_std))+
    geom_point() +
    geom_smooth()
# line begins to top out at high absorbance - assay limitation



# plot and make the linear plot with just the first three values
# where hcg is less than 250
# here is the data...
stand_curve %>%
    filter(hcg<250) %>%
    ggplot(aes(x = hcg, y = abs_std)) +
    geom_point() +
    stat_smooth(method = "lm", formula = y~x) +  
    xlab("hCG") +  
    ylab(paste("Absorbance", as.character(wavelength))) +    
    ggtitle(paste("hCG Standard Curve \n", date)) 



stand_curve <- filter(stand_curve, hcg<250)
line <- lm(stand_curve$abs_std ~ stand_curve$hcg)

# extract intercept and slope from our line object
int <- line$coefficients[1]
slope <- line$coefficients[2]

# now calculate the unknowns 
# R will calculate the whole data frame for us
# and put it into another data frame. 
# abs_data[,2:11] - subsets the numbers we want. 
# round() reduces the number of decimal points. 
hCG_Ukns <- round((abs_data[,2:11] - int)/slope, 1)

hCG_Ukns
# some of the unknowns are below the lowest standard but hey...
--- END ---


Resources