Wednesday, 12 October 2016

Exploring the UK Gender Pay Gap with R...


The gender pay gap in the UK may not of primary interest to some biochemists but the Biochemical Society is interested in gender equality and the majority of biochemistry undergraduates are female... Here is the Biochemical Society's policy statement and here is something relevant from their blog.

A report about the gender pay gap was tweeted about today by @UKParliament (it seems it was published last year). There is an Excel file that goes with the report.


Today, I've been using R to explore some of the data and I have written a script below to make these graphs - the first two graphs from the report.






The way the data is presented makes me uncomfortable with men being paid more represented as a positive percentage and women being paid more being expressed as a negative percentage. I feel sure there is a better way....

Still, the data is interesting....

Here is the script:
START
library(RCurl)
library(readxl)
library(ggplot2)
library(reshape2)
library(ggthemes)

# this is the link to the data
link <- "http://researchbriefings.files.parliament.uk/documents/SN07068/data-tables.xlsx"

# the download.file() function downloads and saves the file with the name given
download.file(url=link,destfile="file.xlsx", mode="wb")

# then we can open the file and extract the data using the read_excel() function. 
data <- read_excel("file.xlsx", skip=3, col_names=TRUE)

str(data)
# shows that Year is characters 
data[,1] <- as.numeric(data[,1])   # change to number
data <- data[1:22,]    # get rid of seven rows of NAs.
names <- colnames(data)
names[1] <- "Year"
names[2] <- "All_employees"
colnames(data) <- names  # make column names easier to use
data[,2:4] <- data[,2:4]*100   # Excel stores percents as decimals

# reshape the data from wide to long format
data.melt <- melt(data, id.vars = "Year")
colnames(data.melt) <- c("Year", "empType", "gendGap")


# draw the graph
p1 <- ggplot(data.melt, aes(x=Year, 
                           y= gendGap, 
                           colour = empType)) + 
  geom_point() +   # draw the points
  geom_line(size=1) +  # draw the lines
  labs(color = "Employment Type") + # customizes the legend title
  ylab("Gender Gap (%)") + # y-label
  ggtitle("Gender Pay Gap, UK, 1997-2015") +   # graph title
  ylim(-10,30) + 
  xlim(1995, 2015) +
  geom_hline(yintercept = 0) +  # nice line at zero
  theme_bw()

p1 <- p1 + theme(legend.text=element_text(size = 12), # increase size of text
               legend.title=element_text(size = 12)) # and title

p1 <- p1 + theme(axis.title.y = element_text(size = 14 )) + 
  theme(axis.text = element_text(size = 12))

p1 # show the graph





# maybe you prefer a different theme.
p1 + theme_hc()





# maybe without a legend but with labels on the lines:
p1 <- p1 + theme(legend.position="none") + 
  geom_text(data = data.melt[which(data.melt$Year == "2013"),],
               aes(label = empType),
               vjust = -2)
p1






# draw the second graph with the age data... 
data2 <- read_excel("file.xlsx", sheet=2, skip=3, col_names=TRUE)
View(data2)
str(data2)
data2 <- data2[1:8,]

# multiply numbers by 100 to give percentages
data2[,2:4] <- data2[,2:4]*100

names <- colnames(data2)
names[1] <- "Ages"
names[2] <- "All_employees"
colnames(data2) <- names
data2.type <- data2[3:8,]

data2.melt <- melt(data2.type, id.vars = "Ages")
colnames(data2.melt) <- c("Ages", "empType", "gendGap")

g <- ggplot(data = data2.melt[7:18,], aes (x = Ages, y = gendGap, fill = empType))  
g <- g + geom_bar(stat="identity", position="dodge", width = 0.75) +
  ylim(-11,20)  +
  ylab("Gender Gap (%)") + # y-label
  xlab("Age") + # x-label
  ggtitle("Gender Pay Gap by Age, April 2015") +
  labs(fill = "Employment Type") + # customizes the legend title
  theme_hc() +
  theme(legend.position=c(0,1), # move to the top left
        legend.justification=c(0,1.5)) # move it in a bit
g  # show the graph...