Introduction

This document provides worked answers for all of the exercises in the Introduction to R with Tidyverse course.

Exercise 1

Numbers

We start by doing some simple calculations.

31 * 78
## [1] 2418
697 / 41
## [1] 17

We next look at how to assign data to named variables and then use those variables in calculations.

We make assignments using arrows and they can point to the right or the left depending on the ordering of our data and variable name.

39 -> x

y <- 22

We can then use these in calculations instead of re-entering the data

x - y
## [1] 17

We can also save the results directly into a new variable.

x - y -> z

z
## [1] 17

Text

We can also store text. The difference with text is that we need to indicate to R that this isn’t something it should try to understand. We do this by putting it into quotes.

"simon" -> my.name

We can use the nchar function to find out how many characters my name has in it.

nchar(my.name)
## [1] 5

We can also use the substr function to get the first letter of my name.

substr(my.name,start=1,stop=1)
## [1] "s"

Exercise 2

Making vectors manually

We’re going to manually make some vectors to work with. For the first one there is no pattern to the numbers so we’re going to make it completley manually with the c() function.

c(2,5,8,12,16) -> some.numbers

For the second one we’re making an integer series, so we can use the colon notation to enter this more quickly.

5:9 -> number.range

Now we can do some maths using the two vectors.

some.numbers - number.range
## [1] -3 -1  1  4  7

Because the two vectors are the same size then the equivalent positions are matched together. Thus the final answer is:

(2-5), (5-6), (8-7), (12-8), (16-9)

Vector functions and subsetting

We’re going to use some functions which return vectors and then use the subsetting functionality on them.

First we’re going to make a numerical sequence with the seq function.

seq(
  from=2,
  by=3,
  length.out = 100
) -> number.series

number.series
##   [1]   2   5   8  11  14  17  20  23  26  29  32  35  38  41  44  47  50  53
##  [19]  56  59  62  65  68  71  74  77  80  83  86  89  92  95  98 101 104 107
##  [37] 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 155 158 161
##  [55] 164 167 170 173 176 179 182 185 188 191 194 197 200 203 206 209 212 215
##  [73] 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269
##  [91] 272 275 278 281 284 287 290 293 296 299

We now want to extract the values at positions 5,10,15 and 20. This means that we need a vector with these values in it. It’s short enough that we can just enter these manually, but we can also see that it’s a mathematical progression, so we could also use seq to create this.

c(5,10,15,20)
## [1]  5 10 15 20
seq(from=5, by=5, to=20)
## [1]  5 10 15 20

We can now use either of these methods to select the correspoding values at those positions in the number.series data structure.

number.series[c(5,10,15,20)]
## [1] 14 29 44 59
number.series[seq(from=5,by=5, to=20)]
## [1] 14 29 44 59

Finally we’re going to extract all values from positions 10 to 30. For this we’ll use the colon operator as we did in the last exercise, but now it’s inside a selector.

number.series[10:30]
##  [1] 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89

Statistical functions and vectors

Since R is a language built around data manipulation and statistics we can use some of the built in statistical functions.

We can use rnorm to generate a sampled set of values from a normal distribution

rnorm(20) -> normal.numbers

Note that if you run this multiple times you’ll get slightly different results.

We can now use the t.test function to test whether this vector of numbers has a mean which is significantly differnt from zero.

t.test(normal.numbers)
## 
##  One Sample t-test
## 
## data:  normal.numbers
## t = 1.1001, df = 19, p-value = 0.285
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1674367  0.5384588
## sample estimates:
## mean of x 
## 0.1855111

Not surprisingly, it isn’t significantly different.

If we do the same thing again but this time use a distribution with a mean of 1 we should see a difference in the statistical results we get.

t.test(rnorm(20, mean=1))
## 
##  One Sample t-test
## 
## data:  rnorm(20, mean = 1)
## t = 6.363, df = 19, p-value = 4.191e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.6841746 1.3549094
## sample estimates:
## mean of x 
##  1.019542

This time the result is significant.

(Optional)Median Centering

To median center the values in number.series we simply calculate the median value and then substract this from each of the values in the vector.

number.series - median(number.series)
##   [1] -148.5 -145.5 -142.5 -139.5 -136.5 -133.5 -130.5 -127.5 -124.5 -121.5
##  [11] -118.5 -115.5 -112.5 -109.5 -106.5 -103.5 -100.5  -97.5  -94.5  -91.5
##  [21]  -88.5  -85.5  -82.5  -79.5  -76.5  -73.5  -70.5  -67.5  -64.5  -61.5
##  [31]  -58.5  -55.5  -52.5  -49.5  -46.5  -43.5  -40.5  -37.5  -34.5  -31.5
##  [41]  -28.5  -25.5  -22.5  -19.5  -16.5  -13.5  -10.5   -7.5   -4.5   -1.5
##  [51]    1.5    4.5    7.5   10.5   13.5   16.5   19.5   22.5   25.5   28.5
##  [61]   31.5   34.5   37.5   40.5   43.5   46.5   49.5   52.5   55.5   58.5
##  [71]   61.5   64.5   67.5   70.5   73.5   76.5   79.5   82.5   85.5   88.5
##  [81]   91.5   94.5   97.5  100.5  103.5  106.5  109.5  112.5  115.5  118.5
##  [91]  121.5  124.5  127.5  130.5  133.5  136.5  139.5  142.5  145.5  148.5

(Optional) T-test power

If we are sampling from two distributions with only a 1% difference in their means, how many observations do we need to have before we can detect them as significantly changing.

Let’s try a few different thresholds to see.

100 Samples

samples <- 100

rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2

t.test(data1,data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 0.281, df = 197.65, p-value = 0.779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5131564  0.6836997
## sample estimates:
## mean of x mean of y 
##  10.15606  10.07079

500 Samples

samples <- 500

rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2

t.test(data1,data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -0.72221, df = 993.57, p-value = 0.4703
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3476367  0.1605922
## sample estimates:
## mean of x mean of y 
##  10.04706  10.14058

1000 Samples

samples <- 1000

rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2

t.test(data1,data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -2.5475, df = 1997.2, p-value = 0.01092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.39747003 -0.05169393
## sample estimates:
## mean of x mean of y 
##  9.971158 10.195740

5000 Samples

samples <- 5000

rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2

t.test(data1,data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = -4.3498, df = 9997.9, p-value = 1.376e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.25087714 -0.09500668
## sample estimates:
## mean of x mean of y 
##  9.962487 10.135429

It’s only really when we get up close to 5000 samples that we can reliably detect such a small difference. The answers will be different every time since rnorm involves a random component.

Exercise 3

We’re going to read some data from a file straight into R. To do this we’re going to use the tidyverse read_ functions. We therefore need to load tidyverse into our script.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Reading a small file

We’ll start off by reading in a small file.

read_tsv("small_file.txt") -> small
## Parsed with column specification:
## cols(
##   Sample = col_character(),
##   Length = col_double(),
##   Category = col_character()
## )
small

Note that the only relevant name from now on is small which is the name we saved the data under. The original file name is irrelevant after the data is loaded.

We can see that Sample and Category have the ‘character’ type because they are text. Length has the ‘double’ type because it is a number.

We want to find the median of the log2 transformed lengths.

To start with we need to extract the length column using the $ notation.

small$Length
##  [1]  45  82  81  56  96  85  65  96  60  62  80  63  50  64  43  98  78  53 100
## [20]  79  84  68  99  65  55  98  56  83  81  69  50  72  54  56  87  84  80  68
## [39]  95  93

Now we can log2 transform this.

log2(small$Length)
##  [1] 5.491853 6.357552 6.339850 5.807355 6.584963 6.409391 6.022368 6.584963
##  [9] 5.906891 5.954196 6.321928 5.977280 5.643856 6.000000 5.426265 6.614710
## [17] 6.285402 5.727920 6.643856 6.303781 6.392317 6.087463 6.629357 6.022368
## [25] 5.781360 6.614710 5.807355 6.375039 6.339850 6.108524 5.643856 6.169925
## [33] 5.754888 5.807355 6.442943 6.392317 6.321928 6.087463 6.569856 6.539159

Finally we can find the median of this.

median(log2(small$Length))
## [1] 6.227664

Reading a larger variants file

The second file we’re going to read is a CSV file of variant data. We therefore need to use read_csv to read it in.

read_csv("Child_Variants.csv") -> child
## Parsed with column specification:
## cols(
##   CHR = col_character(),
##   POS = col_double(),
##   dbSNP = col_character(),
##   REF = col_character(),
##   ALT = col_character(),
##   QUAL = col_double(),
##   GENE = col_character(),
##   ENST = col_character(),
##   MutantReads = col_double(),
##   COVERAGE = col_double(),
##   MutantReadPercent = col_double()
## )
child

We can see that each row is a set of different parameters for a genetic variant.

We want to calculate the mean and standard deviation (sd) of the MutantReadPercent column.

We can again use the $ noatation to extract the relevant column into a vector. We can use head to just look at the first few values.

head(child$MutantReadPercent, n=100)
##   [1]  90  42  28  46 100 100  75  88 100 100  90  92  97  95  80  89  89  81
##  [19]  33  90 100  90 100 100 100  89  82 100  90 100 100  92  93  90  85  93
##  [37]  94  57  31  90  34  51  39  44  42  56  90  26  28  93  93  95  89  77
##  [55] 100  54  37  50  38  30  42  76  47  42  27  50  45  22  35  36  38  45
##  [73]  34  47  54  59  28  91  22  66  57 100  35  34  46 100  26  60 100  44
##  [91]  33  82  25  21  83 100  85  36  84  75

We can then pass this to the relevant functions to calculate the mean and sd.

mean(child$MutantReadPercent)
## [1] 59.88219
sd(child$MutantReadPercent)
## [1] 24.11675

Exercise 4

In this section we are going to use two tidyverse functions to cut down our data. We will use select to pick only certain columns of information to keep, and filter to apply functional selections to the rows.

Filtering small data

Extract all of the data for Category A

filter(small, Category == "A")

Just samples with a length of more than 80

filter(small, Length > 80)

Remove the Sample column. We do a negative column selection by using a minus before the name.

select(small, -Sample)

We could have instead selected the columns we wanted to keep.

select(small, Length, Category)

Filtering child variants

Select data from the mitochondrion (called MT in this data)

filter(child, CHR=="MT")

Select variants with a MutantReadPercent of more than (or equal to) 70.

filter(child, MutantReadPercent >=70)

Select variants with a quality of 200

filter(child, QUAL==200)

Select all variants of the IGFN1 gene

filter(child, GENE=="IGFN1")

Remove the ENST and dbSNP columns.

select(child,-ENST, -dbSNP)

Exercise 5

Here we are going to do some more complicated mutli-step filtering. For this we will be using the %>% pipe operator to link together different steps in the selection and filtering.

Quality filtering

We want variants with a quality of 200, coverage of more than 50 and MutantReadPercent of more than 70.

child %>%
  filter(QUAL == 200) %>%
  filter(COVERAGE > 50) %>%
  filter(MutantReadPercent > 70)

Positional Filtering

We want to remove all variants on the X, Y and MT chromosomes. At the moment we will have to do this in three separate filtering steps, but there is actually a quicker way to do it in one step if we get a bit more advanced.

child %>%
  filter(CHR != "X") %>%
  filter(CHR != "Y") %>%
  filter(CHR != "MT")

Annotation Filtering

We want to see the chromsome and position only for variants with a valid dbSNP id. We can get the valid dbSNP variants by excluding the variants with a dot as their ID.

child %>%
  filter(dbSNP != ".") %>%
  select(CHR,POS)

Transform - find deletions

Here we’re going to do a more complicated filtering where we will transform the data within the filter statement. In this example we’re going to find deletions. In a deletion the REF column will have more than one character in it, so that’s what we’re going to select for.

We can use the nchar function as before, but this time we’re going to use it in a filter condition.

child %>%
  filter(nchar(REF) > 1)

Transform - Q genes

As a second example we’re going to use another transformation along with a conventinal selection to find genes whose name starts with a Q, which have variants where C is the ALT base.

We will use the substr function to find the genes which start with Q and a conventional filter to get the ALT base equal to C.

child %>%
  filter(substr(GENE,1,1) == "Q") %>%
  filter(ALT == "C")

Exercise 6

Here we’re going to draw our first ggplot plot. We’re going to use a new dataset so we need to load that first.

read_tsv("brain_bodyweight.txt") -> brain
## Parsed with column specification:
## cols(
##   Species = col_character(),
##   Category = col_character(),
##   body = col_double(),
##   brain = col_double(),
##   log.brain = col_double(),
##   log.body = col_double()
## )
brain

Simple scatterplot linear

We’ll start out by doing a simple linear scatterplot (using geom_point) of the linear brain and bodyweight values.

brain %>%
  ggplot(aes(x=brain, y=body)) +
  geom_point()

We can change the colour and size of the points by adding some options to the geom_point call. These are fixed values rather than aesthetic mappings so we don’t need to put them in a call to aes.

brain %>%
  ggplot(aes(x=brain, y=body)) +
  geom_point(color="blue", size=3)

Simple scatterplot log

Since the data look oddly positioned on a linear scale we can re-do the plot using the log transformed values instead.

brain %>%
  ggplot(aes(x=log.brain, y=log.body)) +
  geom_point()

That looks nicer. We can see a relationship between the two data types.

Adding a colour aesthetic

Before we modified the plot by adding in a fixed colour parameter. Here we’re going to add a third aesthetic mapping where we use the colour of the points to represent the Category that the animal falls into.

brain %>%
  ggplot(aes(x=log.brain, y=log.body, color=Category)) +
  geom_point()

Now we can see the three groups coloured differently, and a legend explaining the colours appears on the right.

Exercise 7

We’re going to get into more complex plots here, with more configuration of the static and mapped aesthetics, and the inclusion of more geometry layers.

Gene Barplots

We want to see the lengths of all of the samples in category A in the small data set.

small %>%
  filter(Category == "A") %>%
  ggplot(aes(x=Sample, y=Length)) +
  geom_bar(stat = "identity")

We could also have used geom_col which wouldn’t have required the addtional argument.

small %>%
  filter(Category == "A") %>%
  ggplot(aes(x=Sample, y=Length)) +
  geom_col()

Plotting Distributions

We’re going to plot out the distribution of the MutantReadPercent values from child in a couple of different ways. In each case the plot itself is going to summaise the data for us.

We’ll start with a histogram.

child %>%
  ggplot(aes(MutantReadPercent)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can pass options to change the colouring of the bars and the resolution of the histogram.

child %>%
  ggplot(aes(MutantReadPercent)) +
  geom_histogram(binwidth = 5, fill="yellow", color="black")

We can also use geom_density to give a continuous readout of the frequency of values rather than using the binning approach of geom_histogram

child %>%
  ggplot(aes(MutantReadPercent)) +
  geom_density(fill="yellow", color="black")

Frequency barplot

We can also construct a barplot which will summarise our data for us. In the previous barplot example we explicitly set the heights of the bars using an aesthetic mapping. Here we’re going to just pass a set of values to barplot and get it to count them and plot out the frequency for each unique value.

We’re going to do this for the REF bases, but only where we only have single letter REFs.

child %>%
  filter(nchar(REF) == 1) %>%
  ggplot(aes(REF)) +
  geom_bar()

If I’d wanted to make this a bit more colorful I could also colour the bars by the REF (so they’ll also be different colours). This would normally create a legend to explain the colours, but because this is self-explanatory in this case I can use the show.legend parameter to suppress it.

child %>%
  filter(nchar(REF) == 1) %>%
  ggplot(aes(REF, fill=REF)) +
  geom_bar(show.legend = FALSE)

Additional Exercises

Adding text

In a previous exercise we drew a plot of the log brainweight vs the log bodyweight. In the scatterplot we drew we couldn’t actually see which animal was represented by which dot, so we’re going to fix that now.

We can add the names to the animals by adding a geom_text layer to the plot. This will use the same x and y aesthetics as the points (so the labels go in the same place), but it will define a new aesthetic mapping of label with the names of the species.

We can add the new aesthetic mapping either in the original ggplot call, or in the geom_text call.

brain %>%
  ggplot(aes(x=log.brain, y=log.body, color=Category, label=Species)) +
  geom_point() + 
  geom_text()

There are a couple of things to fix.

  1. We want the text to be black - we can add a static aesthetic in geom_text to fix this
  2. We want the text to be smaller. Again we can add a size paramter to geom_text for this
  3. We want the text to move off the point. We can do this with hjust
  4. We want a suitable graph title
  5. We want better axis labels.
brain %>%
  ggplot(aes(x=log.brain, y=log.body, color=Category, label=Species)) +
  geom_point(size=3) + 
  geom_text(color="black", size=3, hjust=1.2) + 
  ggtitle("Brainweight vs Bodyweight") +
  xlab("Brain weight (log2 g)") +
  ylab("Body weight (log2 kg)")

Calculating stats

We can see what appears to be a relationship between the weight of the brain and the body, but is it real? We can do a correlation using the cor.test function to check.

The function takes two vectors as arguments, so we need to extract the two relevant columns out of the brain data and pass them to the function.

cor.test(brain$log.body, brain$log.brain)
## 
##  Pearson's product-moment correlation
## 
## data:  brain$log.body and brain$log.brain
## t = 6.0666, df = 25, p-value = 2.44e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5541838 0.8905446
## sample estimates:
##       cor 
## 0.7716832

We can see that the assoication is significant with p=2.44e-06. We can add this information to the graph.

brain %>%
  ggplot(aes(x=log.brain, y=log.body, color=Category, label=Species)) +
  geom_point(size=3) + 
  geom_text(color="black", size=3, hjust=1.2) + 
  ggtitle("Brainweight vs Bodyweight (corr p=2.44e-06)") +
  xlab("Brain weight (log2 g)") +
  ylab("Body weight (log2 kg)")

Building a linear model

We can also calculate a line of best fit for these two variables. We will do this with the lm function which builds a linear model.

Building models is beyond the scope of this course, but it is different to the corr.test in as much as you pass the whole of the data tibble and then say which paramters you want to predict which other parameter.

The model we will build for the brain dataset is: log.body ~ log.brain which will build a model to predict log bodyweight from log brainweight.

lm(data = brain, formula = log.body ~ log.brain)
## 
## Call:
## lm(formula = log.body ~ log.brain, data = brain)
## 
## Coefficients:
## (Intercept)    log.brain  
##      -2.283        1.215

This tells us that the model line which it constructs has a slope of 1.215 and an intercept of -2.283. We can use the geom_abline geometry to add this as a new layer to the existing plot.

brain %>%
  ggplot(aes(x=log.brain, y=log.body, color=Category, label=Species)) +
  geom_point(size=3) + 
  geom_text(color="black", size=3, hjust=1.2) + 
  ggtitle("Brainweight vs Bodyweight (corr p=2.44e-06)") +
  xlab("Brain weight (log2 g)") +
  ylab("Body weight (log2 kg)") +
  geom_abline(slope = 1.215, intercept = -2.283)

We can see that the extinct species have their own little outgroup away from everything else. They have much smaller brains than their bodyweight would predict.