This document contains the worked answers to the exercises.
We will be using tidyverse so lets load this first, and set a ggplot theme.
library(tidyverse)
theme_set(theme_bw(base_size = 14))
Providing the observed difference between WT and KO cells is of scientific interest, what sample size is needed to achieve 80% power?
power.prop.test(n = NULL, p1 = 10/41, p2 = 14/35,
sig.level = 0.05, power = 0.8,
alternative="two.sided")
##
## Two-sample comparison of proportions power calculation
##
## n = 139.4512
## p1 = 0.2439024
## p2 = 0.4
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
To achieve a power of 80%, you will need a total sample of 280 cells, with 140 in each group.
What sample size is needed to be able to spot a 10% difference with 80% power?
First, lets calculate a mean, sd, and 10% difference.
mouse_mean <- mean(c(27.2, 25.5, 26, 29.1))
mouse_sd <- sd(c(27.2, 25.5,26,29.1))
mouse_mean2 <- mouse_mean*0.9
mouse_mean
## [1] 26.95
mouse_mean2
## [1] 24.255
mouse_sd
## [1] 1.601041
Then do the power analysis
power.t.test(delta = mouse_mean-mouse_mean2,
sd = mouse_sd,
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 6.652643
## delta = 2.695
## sd = 1.601041
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
To be able to spot a 10% difference in the mice weight, at 80% power, assuming the KO group have the same variance as the WT, you will need 7 mice per group.
Use the data to calculate the values for a power calculation
First, lets read in the data (spider.data.csv)
spider <- read_csv("Datasets to use/spider.data.csv")
## Rows: 10 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Group
## dbl (2): Participant, Scores
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spider
Now lets calculate what we need for the power analysis
spider %>%
group_by(Group) %>%
summarise(mean=mean(Scores), sd=sd(Scores))
Finally, lets run the power analysis
power.t.test(delta = 52 - 39,
sd = 9.75,
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 9.889068
## delta = 13
## sd = 9.75
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Providing the preliminary results are to be trusted, to be able to pick up such a difference between the 2 groups, with a power of 80% and a significance level of 5%, we will need at least 10 arachnophobes in each group.
How many arachnophobes would you need to achieve a power of 80%, if based on a paired design? For this we want to look at the differences between the two scores, so lets calculate these
spider %>%
pivot_wider(names_from = Group, values_from = Scores) %>%
mutate(difference = Picture - Real) %>%
summarise(mean = mean(difference), sd = sd(difference))
And run the paired power analysis.
power.t.test(delta = -13,
sd = 5.7,
sig.level = 0.05,
power = 0.8,
type = "paired",
alternative = "two.sided")
##
## Paired t test power calculation
##
## n = 3.782171
## delta = 13
## sd = 5.7
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
To achieve a power of 80%, based on a paired design, we would need 4 arachnophobes.
Set up the packages we need.
library(tidyverse)
library(rstatix)
library(ggpubr)
library(ggsignif)
theme_set(theme_bw(base_size = 14))
Is there an effect of treatment on the monkeys’ performance?
First, read in the data and calculate the differences
working.memory <- read_csv("Datasets to use/working.memory.csv")
working.memory
working.memory <- working.memory %>%
mutate(difference = DA.depletion - placebo)
working.memory
Lets have a look at our data and check assumptions by producing some plots of the differences.
working.memory %>%
ggplot(aes("DA.Depletion", difference))+
geom_jitter(height=0, width=0.05, size=4, colour="chartreuse3")+
stat_summary(geom="errorbar",fun="mean", fun.min="mean", fun.max="mean", width=0.3, linewidth=1)+
stat_summary(geom="errorbar", fun.data=mean_cl_normal, width=0.15)+
scale_y_continuous(breaks=-16:0, limits=c(-16, 0))+
xlab(NULL)+
ylab("Mean difference +/- 95% CI")
working.memory %>%
ggplot(aes("DA.Depletion", difference))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(height=0, width=0.05, size=4, colour="chartreuse3")+
scale_y_continuous(breaks=-16:0, limits=c(-16, 0))+
xlab(NULL)+
ylab("Mean difference +/- 95% CI")
They look fairly normal, but lets do a test and have a look at the QQ plot.
#library(rstatix)
working.memory %>%
shapiro_test(difference)
working.memory %>%
ggplot(aes(sample = difference)) +
stat_qq()+
stat_qq_line()
#library(ggpubr)
ggqqplot(working.memory, "difference")
Our data pass the normality test and the QQ plot looks good, so lets run a paired t-test - there are two possible ways of doing this, by doing a one-sample t-test on the differences:
working.memory %>%
t_test(difference ~ 1, mu=0)
working.memory %>%
t_test(difference ~ 1, mu=0, detailed = TRUE)
Or by doing a paired t-test on a long format of the data:
NOTE: if you do this, make sure to sort by your ID column (using arrange(ID_col), in this example arrange(Subject)), as paired t_test assumes the 1st data point in your first group (placebo) corresponds to the 1st data point in the second group (DA.depletion). If your order is randomised, it will incorrectly pair your samples, and the results will be incorrect.
working.memory.long <- working.memory %>%
pivot_longer(cols = c(placebo, DA.depletion), names_to = "treatment", values_to = "score")
working.memory.long %>%
arrange(Subject) %>%
t_test(score ~ treatment, paired = TRUE, detailed = TRUE)
Both options give the same result: The injection of a dopamine-depleting agent significantly affects working memory in rhesus monkeys (t=-8.62, df=14, p=5.715e-7).
Now we can produce nice plots of our data to present the results - there are many ways of doing this, but below are a few examples.
working.memory.long %>%
arrange(Subject) %>%
ggplot(aes(x=treatment, y=score, group=Subject))+
geom_line(linewidth=1, colour = "grey")+
geom_point(colour= "black", size = 2) +
scale_y_continuous(breaks=seq(from =0, by=5, to=60),
limits = c(0,60)) +
geom_signif(comparisons = list(c("placebo", "DA.depletion")),
test = "t.test", test.args = list(paired=TRUE),
map_signif_level = TRUE)
working.memory.long %>%
arrange(Subject) %>%
ggline(x = "treatment", y = "score", group = "Subject",
color = "grey", point.color = "treatment", palette = "Dark2",
point.size = 1.5, line.size = 0.4, legend = "none")+
scale_y_continuous(breaks=seq(from =0, by=5, to=60),
limits = c(0,60))+
geom_signif(comparisons = list(c("placebo", "DA.depletion")),
test = "t.test", test.args = list(paired=TRUE),
map_signif_level = TRUE)
working.memory.long %>%
arrange(Subject) %>%
t_test(score ~ treatment, paired = TRUE) -> stat.test
working.memory.long %>%
ggpaired(x = "treatment", y = "score", color = "treatment", id = "Subject",
palette = "Dark2", line.color = "gray", line.size = 0.4, legend = "none")+
scale_y_continuous(breaks=seq(from =0, by=5, to=60),
limits = c(0,60))+
stat_pvalue_manual(stat.test, label=" p = {p}", y.position = 55)
Which beans should you choose for your coffee?
Start by reading in and plotting the data.
coffee <- read_csv("Datasets to use/coffee.bean.species.csv")
coffee
coffee %>%
ggplot(aes(Species,Scores.Total, colour = Species)) +
geom_jitter(height=0, width=0.2, size = 2, show.legend = FALSE) +
stat_summary(geom="crossbar", fun=mean, width=0.6, linewidth=0.3, show.legend = FALSE) +
ylab("Score")
coffee %>%
ggplot(aes(x=Species, y=Scores.Total)) +
geom_boxplot()
coffee %>%
ggplot(aes(x=Species, y=Scores.Total))+
geom_violin(trim=TRUE, linewidth=1, show.legend=FALSE)+
geom_jitter(height=0, width=0.2, size = 2, show.legend = FALSE) +
ylab("Score")+
scale_fill_brewer(palette="Dark2")
Lets have a look at the normality tests.
ggqqplot(coffee, x = "Scores.Total", facet.by = "Species") +
theme_bw()
model <- lm(Scores.Total ~ Species, data = coffee)
# Create a QQ plot of residuals
ggqqplot(residuals(model)) +
theme_bw()
shapiro_test(residuals(model))
coffee %>%
group_by(Species) %>%
shapiro_test(Scores.Total)
coffee %>%
levene_test(Scores.Total ~ Species)
The data do not look perfect, and there is one point outside the confidence intervals of the QQ plot using residuals, but looking at the groups individually they look OK and we pass the other tests, so we will carry on with a t-test for now.
coffee %>%
t_test(Scores.Total~Species, var.equal = TRUE, detailed = TRUE)
There is a significant difference in taste score between the two coffee beans (t=4.47, df=34, p = 8.36e-05). You should choose the Arabica beans as these get a higher score.
Lets produce a final plot of our data.
coffee %>%
t_test(Scores.Total~Species, var.equal = TRUE,) -> t_results
coffee %>%
ggplot(aes(Species, Scores.Total)) +
stat_boxplot(geom="errorbar", width=0.2)+
geom_boxplot(outlier.shape = NA)+
geom_jitter(height=0, width=0.1, size = 2, alpha = 0.5, colour="red")+
ylab("Score")+
geom_signif(comparisons = list(c("Arabica", "Robusta")), annotations = t_results$p)
Set up the packages we need.
library(tidyverse)
library(rstatix)
library(ggpubr)
library(ggsignif)
theme_set(theme_bw(base_size = 14))
Read in the data and do some initial plots. We have to force the fertiliser type to be a factor or character, as it gets automatically detected as a “double” (numeric), when it is actually a categorical variable.
crop <- read_csv("Datasets to use/crop.yield.csv")
crop <- crop %>%
mutate(fertilizer = as.factor(fertilizer))
crop
crop %>%
ggplot(aes(x=fertilizer, y=yield, colour=fertilizer))+
geom_jitter(height=0, width=0.3, size=3)+
stat_summary(geom="crossbar", fun=mean, width=0.6, linewidth=0.3, show.legend = FALSE)
crop %>%
ggplot(aes(x=fertilizer, y=yield, fill=fertilizer))+
geom_boxplot(show.legend = FALSE)
crop %>%
ggplot(aes(x=fertilizer, y=yield, colour=fertilizer))+
geom_violin(show.legend = FALSE)+
geom_jitter(height=0, width=0.25, alpha=0.5, size=3, show.legend = FALSE)
Lets have a look at the QQ plot and normality tests.
model <- aov(yield ~ fertilizer, data = crop)
ggqqplot(residuals(model)) +
theme_bw()
shapiro_test(residuals(model))
#Or on each group
crop %>%
group_by(fertilizer) %>%
shapiro_test(yield)
## Second assumption with Levene
crop %>%
levene_test(yield ~ fertilizer)
We look to be meeting all assumptions required for the ANOVA, so we will carry out the test.
# omnibus step #
crop %>%
anova_test(yield~fertilizer, detailed=TRUE)
# Tukey #
crop %>%
tukey_hsd(yield~fertilizer)
After running the ANOVA, we can see there is a significant difference between the three fertilisers from the omnibus test (F=7.86, p=0.0007), so we can reject the null hypothesis that there is no difference. From the multiple comparisons test, we can see that fertiliser 3 is significantly different from both 1 and 2 (p=0.0006 and p=0.0209, respectively), but there is no significant difference between fertilisers 1 and 2 (p=0.495). Therefore, we would want to use fertiliser 3 to maximise yield.
Finally, lets plot our results.
crop %>%
tukey_hsd(yield~fertilizer)%>%
mutate(comparison = paste(group1, sep="-", group2)) -> tukey.conf
tukey.conf %>%
ggplot(aes(x=comparison, y=estimate, ymin=conf.low, ymax=conf.high)) +
geom_errorbar(colour="black", linewidth=1)+
geom_point(size=3, colour="darkred")+
geom_hline(yintercept=0, linetype="dashed", color = "red")+
coord_flip()
sig.comp <- tukey.conf %>%
filter(p.adj<0.05)
crop %>%
ggplot(aes(x=fertilizer, y=yield, colour=fertilizer)) +
geom_boxplot(show.legend = FALSE)+
geom_jitter(height=0, width=0.1, alpha=0.5, size=5, show.legend = FALSE)+
geom_signif(comparisons = list(c("1","3"),
c("2","3")),
annotations = sig.comp$p.adj,
y_position = c(179.3, 179.7),
colour = "black",
show.legend = FALSE)
crop %>%
ggplot(aes(x=fertilizer, y=yield, colour=fertilizer)) +
geom_boxplot(show.legend = FALSE)+
geom_jitter(height=0, width=0.1, alpha=0.5, size=5, show.legend = FALSE)+
stat_pwc(method = "tukey_hsd", label = "p.adj.signif", hide.ns = TRUE, show.legend = FALSE)
Is there a difference between KO with/without treatment and WT?
Start by reading in the data, convert the Condition to a factor, specifying the levels, to make WT the first group.
neutrophils.long <-read_csv("Datasets to use/neutrophils.long.csv")
neutrophils.long %>%
mutate(Condition=factor(Condition, levels = c("WT", "KO", "KO+T1", "KO+T2"))) -> neutrophils.long
neutrophils.long
Then plot the data, ideall showing the matching by experiment, so we can see how this looks alongside the different conditions.
neutrophils.long %>%
ggplot(aes(x=Condition, y=Values, group=Experiment, colour=Experiment, fill=Experiment))+
geom_line(linewidth=2)+
geom_point(size=3, shape = 21, colour= "black", stroke=2)
neutrophils.long %>%
ggplot(aes(x=Condition, y=Values))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(height = 0, width = 0.2, size=3)
neutrophils.long %>%
ggplot(aes(x=Condition, y=Values))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(aes(x=Condition, y=Values, colour=Experiment), height = 0, width = 0.2, size=4, alpha=0.7)
Now lets look at the normality tests & QQ plot.
neutrophils.long %>%
group_by(Condition) %>%
shapiro_test(Values)
ggqqplot(neutrophils.long, "Values", facet.by = "Condition")
We can also build a model to test the whole dataset together, looking at the residuals, however this takes some additional parts to the formula given to aov() and specifying the correct data from this model. For this, we need to include an error term for the Experiment and add this to the formula as below. The model produced contains two “Stratum 1” and “Stratum 2” containg information on the contribution of the experiment and the differences between the conditions “Within” the experiments, which is what we are interested in.
This is getting towards more advanced model building, which is beyong the scope of this course, so do not worry too much about this part. If in doubt, use the above approach to look at each group individually.
model <- aov(Values~Condition+Error(factor(Experiment)), data = neutrophils.long)
model
##
## Call:
## aov(formula = Values ~ Condition + Error(factor(Experiment)),
## data = neutrophils.long)
##
## Grand Mean: 63.55
##
## Stratum 1: factor(Experiment)
##
## Terms:
## Residuals
## Sum of Squares 4208.7
## Deg. of Freedom 4
##
## Residual standard error: 32.43725
##
## Stratum 2: Within
##
## Terms:
## Condition Residuals
## Sum of Squares 10947.75 1532.50
## Deg. of Freedom 3 12
##
## Residual standard error: 11.30081
## Estimated effects may be unbalanced
ggqqplot(residuals(model$Within)) +
theme_bw()
So overall the data don’t look amazing on these tests, but each group passes normality tests and the QQ plots OK, if slightly borderline, so we will continue with our ANOVA.
Do the omnibus and post-hoc tests. For the omnibus test, we need to include the experiment as part of the analysis, so we use a slightly different approach to inputting to anova_test(), as below.
We need to again specify that there is pairing across the conditions when doing post-hoc tests (remembering to sort by our ID, so the correct pairings are inferred), and in this case we are interested in comparing to the WT group, so we include this as a ref.group.
neutrophils.long %>%
anova_test(dv = Values, wid = Experiment, within = Condition) -> res.aov
res.aov
## ANOVA Table (type III tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Condition 3 12 28.575 9.51e-06 * 0.656
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 Condition 0.121 0.36
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 Condition 0.692 2.07, 8.3 0.000179 * 1.45 4.35, 17.4 9.51e-06
## p[HF]<.05
## 1 *
get_anova_table(res.aov)
# post-hoc test: bonferroni #
neutrophils.long %>%
arrange(Experiment) %>%
pairwise_t_test(Values~Condition, paired=TRUE, ref.group = "WT", p.adjust.method = "bonferroni")
# post-hoc test: holm #
neutrophils.long %>%
arrange(Experiment) %>%
pairwise_t_test(Values~Condition, paired=TRUE, ref.group = "WT", p.adjust.method = "holm")
Put together into a final plot - we can look at the difference to WT and/or show the linkages
neutrophils.long %>%
group_by(Experiment) %>%
mutate(Difference=Values-Values[Condition=="WT"]) %>%
ungroup() -> neutrophils.long
neutrophils.long %>%
filter(Condition !="WT") %>%
select(-Values) -> neutrophils.differences
neutrophils.differences %>%
ggplot(aes(x=Condition, y=Difference, fill=Condition)) +
geom_bar(stat = "summary", fun="mean", colour="black", show.legend = FALSE)+
stat_summary(geom="errorbar", fun.data=mean_cl_normal, width=0.15)+
geom_jitter(aes(x=Condition, y=Difference, colour=Experiment), height=0, width=0.1, size=4, show.legend=FALSE)+
ylab("Mean difference from WT with 95% CI")+
scale_y_continuous(breaks=seq(from=-40, by=10, to=80))+
scale_fill_brewer(palette = "PuOr")
neutrophils.long %>%
arrange(Experiment) %>%
pairwise_t_test(Values~Condition, paired=TRUE, ref.group = "WT", p.adjust.method = "holm") %>%
add_xy_position(x="Condition") -> neutro.results
neutrophils.long %>%
ggplot(aes(x=Condition, y=Values))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(aes(x=Condition, y=Values, colour=Experiment), height = 0, width = 0.2, size=4, alpha=0.5)+
stat_pvalue_manual(neutro.results, label = "p = {p.adj}", label.size=5)+
scale_y_continuous(breaks=seq(from=0, by=20, to=180))
results.sig <- neutro.results %>%
filter(p.adj<0.05)
neutrophils.long %>%
ggplot(aes(x=Condition, y=Values, group=Experiment, colour=Experiment, fill=Experiment))+
geom_line(linewidth=2)+
geom_point(size=3, shape = 21, colour= "black", stroke=2) +
geom_signif(comparisons = list(c("WT","KO"),
c("WT","KO+T2")),
annotations = results.sig$p.adj,
y_position = c(132, 140),
colour = "black",
show.legend = FALSE)
Set up the packages we need.
library(tidyverse)
library(rstatix)
library(gglm)
theme_set(theme_bw(base_size = 14))
Is there a relationship between parasite burden and body mass in roe deer?
Read in the data and plot.
deer <- read_csv("Datasets to use/Roe.deer.csv")
deer
deer %>%
ggplot(aes(PL, BM, colour = sex)) +
geom_point(size=3) +
xlab("Parasite buden") +
ylab("Body mass")
Lets create a line of best fit and add it to the graph
deer.female <- deer %>%
filter(sex == "F")
lm(BM ~ PL, data=deer.female)-> fit.deer.female
coefficients(fit.deer.female) -> coef.deer.female
coef.deer.female
## (Intercept) PL
## 25.041029 -1.889378
deer.male <- deer %>%
filter(sex == "M")
lm(BM ~ PL, data=deer.male)-> fit.deer.male
coefficients(fit.deer.male) -> coef.deer.male
coef.deer.male
## (Intercept) PL
## 30.204398 -4.621496
deer %>%
ggplot(aes(PL, BM, colour = sex)) +
geom_point(size=3) +
geom_abline(intercept = coef.deer.female[1], slope = coef.deer.female[2], colour = "orange2", linewidth=1) +
geom_abline(intercept=coef.deer.male[1],slope=coef.deer.male[2], colour = "purple2", linewidth=1) +
scale_colour_manual(values=c("orange2", "purple2"))+
xlab("Parasite buden") +
ylab("Body mass")
How good are our models? Lets calculate Cook’s Distance and the residuals, and plot the diagnostic plots.
deer.female$cook <- cooks.distance(fit.deer.female)
deer.female$residual <- rstandard(fit.deer.female)
deer.female %>%
arrange(desc(cook)) %>%
head(5)
deer.female %>%
arrange(desc(abs(residual))) %>%
head(5)
par(mfrow=c(2,2))
plot(fit.deer.female)
deer.male$cook <- cooks.distance(fit.deer.male)
deer.male$residual <- rstandard(fit.deer.male)
deer.male %>%
arrange(desc(cook)) %>%
head(5)
deer.male %>%
arrange(desc(abs(residual))) %>%
head(5)
par(mfrow=c(2,2))
plot(fit.deer.male)
There aren’t any points that cause concern based on Cook’s distance or the residuals and the QQ plots look OK so we will carry on with our correlation.
deer %>%
group_by(sex) %>%
cor_test(PL, BM)
There is a negative correlation between parasite load and fitness but this relationship is only significant for the males (r = -0.75, p = 0.0049 vs. females: r = -0.30, p = 0.2940).
Read in and plot the data
read_csv("Datasets to use/exam.anxiety.csv") -> exam.anxiety
exam.anxiety
exam.anxiety %>%
ggplot(aes(x=Revise, y=Anxiety, colour=Gender)) + geom_point(size=3)
Create fits for males and females separately and plot.
exam.anxiety %>%
filter(Gender=="Female") -> exam.anxiety.female
lm(Anxiety~Revise, data=exam.anxiety.female) -> fit.female
coefficients(fit.female) -> cf.fit.female
exam.anxiety %>%
filter(Gender=="Male") -> exam.anxiety.male
lm(Anxiety~Revise, data=exam.anxiety.male) -> fit.male
coefficients(fit.male) -> cf.fit.male
exam.anxiety %>%
ggplot(aes(x=Revise, y=Anxiety, colour=Gender, label=Code))+
geom_point(size=2)+
geom_abline(intercept=cf.fit.male[1], slope=cf.fit.male[2], colour="purple", linewidth=1)+
geom_abline(intercept=cf.fit.female[1], slope=cf.fit.female[2], colour="orange", linewidth=1)+
scale_colour_manual(values=c("orange", "purple"))
It looks like there might be a couple of outliers in the dataset - lets add labels so we can see which ones they are.
exam.anxiety %>%
ggplot(aes(x=Revise, y=Anxiety, colour=Gender, label=Code))+
geom_point(size=2)+
geom_abline(intercept=cf.fit.male[1], slope=cf.fit.male[2], colour="purple", linewidth=1)+
geom_abline(intercept=cf.fit.female[1], slope=cf.fit.female[2], colour="orange", linewidth=1)+
scale_colour_manual(values=c("orange", "purple"))+
geom_text(nudge_x = 3, size=4, show.legend = FALSE)
Lets have a look at the diagnostics plots.
par(mfrow=c(2,2))
plot(fit.female)
plot(fit.male)
As suspected, there are definitely some problematic looking points. In females, it looks like there are two potential outliers (43 and 11 on the diagnostic plots) and in males there looks to be one outlier that is very far off from the rest of the data (40).
The numbers in the diagnostic plots are the row numbers of the points in the input data, so we can pull out which ID this is in the original data by pulling out those rows
exam.anxiety.female[c(11, 43),]
exam.anxiety.male[40,]
So in females these correspond to 87 and 24 in the original dataset, and in the males it corresponds to 78.
Alternatively, if we rename the rownames for the input data we can make the values shown on the diagnostic plots the same as the ID (Code) in the original data:
exam.anxiety %>%
filter(Gender=="Female") %>%
column_to_rownames("Code") -> exam.anxiety.id.female
lm(Anxiety~Revise, data=exam.anxiety.female) -> fit.id.female
exam.anxiety %>%
filter(Gender=="Male") %>%
column_to_rownames("Code") -> exam.anxiety.id.male
lm(Anxiety~Revise, data=exam.anxiety.male) -> fit.id.male
par(mfrow=c(2,2))
plot(fit.id.female)
plot(fit.id.male)
So now we can directly see that the outlying points are 87 and 24 in females and 78 in males. This also works if you have text labels for your points.
Now we have identified these points, lets have a look at how good the existing model is and consider whether to remove them.
summary(fit.male)
##
## Call:
## lm(formula = Anxiety ~ Revise, data = exam.anxiety.male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.124 -2.900 2.221 6.750 16.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.1941 2.6213 32.119 < 2e-16 ***
## Revise -0.5353 0.1016 -5.267 2.94e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.3 on 50 degrees of freedom
## Multiple R-squared: 0.3568, Adjusted R-squared: 0.344
## F-statistic: 27.74 on 1 and 50 DF, p-value: 2.937e-06
summary(fit.female)
##
## Call:
## lm(formula = Anxiety ~ Revise, data = exam.anxiety.female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.687 -6.263 -1.204 4.197 38.628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.94181 2.27858 40.35 < 2e-16 ***
## Revise -0.82380 0.08173 -10.08 1.54e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.42 on 49 degrees of freedom
## Multiple R-squared: 0.6746, Adjusted R-squared: 0.668
## F-statistic: 101.6 on 1 and 49 DF, p-value: 1.544e-13
exam.anxiety %>%
group_by(Gender) %>%
cor_test(Revise, Anxiety)
We can see that based on all the data including outliers we do not get a very good fit for males (r=-0.6, r2=0.3568), and we could potentially also improve on the fit for females (R2=0.6746). Lets have a look at removing our influential outliers.
First, focus on the males - calculate residuals and cooks distance to filter based on these.
cooks.distance(fit.male) -> cook.m
rstandard(fit.male) -> st.resid.m
exam.anxiety.male %>%
add_column(st.resid.m) %>%
add_column(cook.m) -> exam.anxiety.male
exam.anxiety.male %>%
arrange(desc(abs(st.resid.m)))
exam.anxiety.male %>%
filter(abs(st.resid.m)<3) -> exam.anxiety.male.78
78 has a residual below -3, so we remove this one. Lets fit another model without 78 and see if this has improved it.
lm(Anxiety~Revise, data=exam.anxiety.male.78) -> fit.male.78
summary(fit.male.78)
##
## Call:
## lm(formula = Anxiety ~ Revise, data = exam.anxiety.male.78)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0296 -3.8704 0.5626 6.0786 14.2525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.97461 1.64755 52.790 < 2e-16 ***
## Revise -0.60752 0.06326 -9.603 7.59e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.213 on 49 degrees of freedom
## Multiple R-squared: 0.653, Adjusted R-squared: 0.6459
## F-statistic: 92.22 on 1 and 49 DF, p-value: 7.591e-13
exam.anxiety.male.78 %>%
cor_test(Revise, Anxiety)
par(mfrow=c(2,2))
plot(fit.male.78)
The new fit is a big improvement after removal of 78, now r=-0.81 and r2=0.653 (up from r=-0.6, r2=0.3568). The diagnostic plots look a lot more reasonable now too.
Lets move on to the female data - first we will try filtering using the standardised residuals, as we did for the males.
cooks.distance(fit.female) -> cook.f
rstandard(fit.female) -> st.resid.f
exam.anxiety.female %>%
add_column(st.resid.f) %>%
add_column(cook.f) -> exam.anxiety.female
exam.anxiety.female %>%
arrange(desc(abs(st.resid.f)))
exam.anxiety.female %>%
filter(abs(st.resid.f)<3) -> exam.anxiety.female.87
This removes 87 - lets have a look at the fit now.
lm(Anxiety~Revise, data=exam.anxiety.female.87) -> fit.female.87
summary(fit.female.87)
##
## Call:
## lm(formula = Anxiety ~ Revise, data = exam.anxiety.female.87)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7518 -5.7069 -0.7782 3.2117 18.5538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.24536 1.93591 47.65 <2e-16 ***
## Revise -0.87504 0.07033 -12.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.849 on 48 degrees of freedom
## Multiple R-squared: 0.7633, Adjusted R-squared: 0.7584
## F-statistic: 154.8 on 1 and 48 DF, p-value: < 2.2e-16
exam.anxiety.female.87 %>%
cor_test(Revise, Anxiety)
par(mfrow=c(2,2))
plot(fit.female.87)
This has slightly improved the model: r=-0.87, r2=0.7633 (from r=-0.82, r2=-0.6746). However, from the diagnostic plots we can see that 11 (24 in dataset) has a Cook’s distance greater than 1, so lets have a look at removing this point too.
exam.anxiety.female.87 %>%
arrange(desc(cook.f))
exam.anxiety.female.87 %>%
filter(cook.f<1) -> exam.anxiety.female.87.24
lm(Anxiety~Revise, data=exam.anxiety.female.87.24) -> fit.female.87.24
summary(fit.female.87.24)
##
## Call:
## lm(formula = Anxiety ~ Revise, data = exam.anxiety.female.87.24)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4485 -5.9633 -0.6506 4.4159 17.3276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.61640 1.93047 46.94 < 2e-16 ***
## Revise -0.77307 0.07697 -10.04 2.78e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.354 on 47 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6754
## F-statistic: 100.9 on 1 and 47 DF, p-value: 2.78e-13
exam.anxiety.female.87.24 %>%
cor_test(Revise, Anxiety)
par(mfrow=c(2,2))
plot(fit.female.87.24)
Removing 24 has actually made the model slightly worse, with r=-0.83, r2=0.6822, down from r=-0.87, r2=0.7633. So, if we want to select the model that is the best fit for the data, we will stick with removal of 87 but keep 24.
We can check our models another way using the Shapiro normality test.
shapiro_test(st.resid.m)
shapiro_test(st.resid.f)
exam.anxiety.male.78 %>%
shapiro_test(st.resid.m)
exam.anxiety.female.87 %>%
shapiro_test(st.resid.f)
Before removal of the outliers, our data were failing normality tests but this improves after outlier removal. That being said, the male data do still fail this test, so this is something to keep in mind. Another thing to note, is that how to deal with outliers can be tricky to decide, especially when it is not your own data and you do not have the context behind it. You should decide how you will deal with outliers before seeing your data, and be consistent across experiments (i.e. do not delete them when you get a significant result without them and leave them in when you get a significant result with them). How to deal with them also depends on your end goal - if you want to create the best model for predicting future data, you might want to remove outliers to increase model fit for the bulk of the data, however if you are looking at the correlation/relationship between your variables then outliers might be real data points that shouldn’t just be ignored. In this case, we will carry on with the rest of the analysis based on removing the two outliers.
Finally, we can join together our males and females and see if there are differences between them, plus plot a final graph of the results.
bind_rows(exam.anxiety.female.87, exam.anxiety.male.78) -> exam.anxiety.87.78
lm(Anxiety ~ Revise + Gender + Revise*Gender, data=exam.anxiety.87.78) -> fit.genders
summary(fit.genders)
##
## Call:
## lm(formula = Anxiety ~ Revise + Gender + Revise * Gender, data = exam.anxiety.87.78)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0296 -5.6022 -0.3294 5.6091 18.5538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.24536 1.86694 49.410 < 2e-16 ***
## Revise -0.87504 0.06783 -12.901 < 2e-16 ***
## GenderMale -5.27075 2.53296 -2.081 0.04008 *
## Revise:GenderMale 0.26752 0.09445 2.832 0.00562 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.534 on 97 degrees of freedom
## Multiple R-squared: 0.7228, Adjusted R-squared: 0.7142
## F-statistic: 84.32 on 3 and 97 DF, p-value: < 2.2e-16
We can conclude that: Among male and female students, there is a significant relationship between time spent revising for an exam and anxiety (r=-0.81, p=7.59e-13 and r=-0.87, p<2e-16, respectively). Furthermore, that relationship is significantly stronger in females than in males (p=0.0056).
Finally, lets produce a plot of our results.
coefficients(fit.male.78) -> cf.fit.male.78
coefficients(fit.female.87) -> cf.fit.female.87
exam.anxiety.87.78 %>%
ggplot(aes(Revise, Anxiety, colour=Gender))+geom_point(size=4)+
geom_abline(aes(intercept=cf.fit.male.78[1], slope=cf.fit.male.78[2]), colour="purple", linewidth=1)+
geom_abline(aes(intercept=cf.fit.female.87[1], slope=cf.fit.female.87[2]), colour="orange", linewidth=1)+
scale_colour_manual(values = c("orange", "purple"))+
scale_x_continuous(breaks=seq(from=0, by=10, to=100))+
scale_y_continuous(breaks=seq(from=0, by=10, to=100))+
annotate("text", label = "Females: r = -0.87", x = 50, y = 30, size = 6, colour = "orange")+
annotate("text", label = "Males: r = -0.81", x = 80, y = 50, size = 6, colour = "purple")+
annotate("text", label = "Interaction: Revise*Gender: p=0.00562", x = 40, y = 0, size = 6, colour = "black")+
ggtitle("Relationship between time spent revising and anxiety in male and female students")
Set up the packages we need.
library(tidyverse)
library(ggpubr)
library(rstatix)
library(ggsignif)
theme_set(theme_bw(base_size = 14))
Can Cambridge students tell the difference between T-shirts from Oxford or Cambridge?
Read in and plot the data.
read_csv("Datasets to use/smelly.teeshirt.csv") -> smelly.teeshirt
smelly.teeshirt %>%
ggplot(aes(x=university, y=smell, fill=university))+
geom_boxplot(alpha=0.2, show.legend = FALSE, outlier.shape = NA)+
geom_jitter(height=0, width=0.1, size=4, colour="firebrick4", alpha = 0.8, show.legend = FALSE)+
scale_fill_brewer(palette="Dark2")+
scale_y_continuous(breaks=seq(from=0, by=1, to=8), limits = c(0, 8))
As we are using a scoring system, this fails the assumption of linearity (or interval data) - that the distance between points of the scale should be equal at all parts along the scale - so we will use a non-parametric test for our analysis.
Based on the above assumption not being met and so using a non-parametric test, we do not need our data to meet any parametric assumptions, so it is not necessary to carry out tests to make sure our data meet these. This means we can go straight for the test after having a look at our data.
smelly.teeshirt %>%
wilcox_test(smell~university)
Based on these data, we conclude that Cambridge students can tell the difference between t-shirts from Oxford and Cambridge students, based on smell (U = 5, p = 0.00479).
However, there is a significant flaw in the design of the experiment: the T-shirts had logos on, meaning the study is not blinded, and subject to bias. It therefore seems likely that the students rated the Oxford t-shirts worse because they had the logo on, rather than anything to do with the smell. A better designed study would have used the same t-shirts for both Oxford and Cambridge students, and could have used a paired design to better account for variability in scoring between individuals.
Nevertheless, lets plot our results so we get a final figure.
smelly.teeshirt %>%
wilcox_test(smell~university) %>%
add_xy_position()-> smelly.test
smelly.teeshirt %>%
ggplot(aes(x=university, y=smell))+
geom_boxplot(aes(fill=university), alpha=0.2, show.legend = FALSE)+
geom_jitter(height=0, width=0.1, size=4, colour="firebrick4", alpha = 0.8, show.legend = FALSE)+
scale_fill_brewer(palette="Dark2")+
scale_y_continuous(breaks=seq(from=0, by=1, to=8), limits = c(0, 8))+
stat_pvalue_manual(smelly.test, label="Mann-Whitney test, p = {p}", size=5)
smelly.teeshirt %>%
ggplot(aes(x=university, y=smell))+
geom_boxplot(aes(fill=university), alpha=0.2, show.legend = FALSE)+
geom_jitter(height=0, width=0.1, size=4, colour="firebrick4", alpha = 0.8, show.legend = FALSE)+
scale_fill_brewer(palette="Dark2")+
scale_y_continuous(breaks=seq(from=0, by=1, to=8), limits = c(0, 8))+
geom_signif(comparisons = list(c("Cambridge", "Oxford")), test = "wilcox.test",
map_signif_level = TRUE, textsize = 8)
Do botulinum toxin injections reduce muscle spasticity levels?
Read in and have an initial look at the data and the differences.
read_csv("Datasets to use/botulinum.long.csv") -> botulinum
botulinum %>%
mutate(treatment = factor(treatment, levels = c("before", "after"))) -> botulinum
botulinum %>%
ggplot(aes(x=treatment, y=scores))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(height=0, width=0.1) +
scale_x_discrete(limits =c("before", "after"))
botulinum %>%
group_by(id) %>%
mutate(difference = scores - scores[treatment == 'before']) %>%
filter(treatment == "after") %>%
ggplot(aes("Difference", difference))+
geom_jitter(height = 0, width=0.1, size=5, alpha=0.5)+
stat_summary(geom="errorbar", fun=median,fun.min=median,fun.max=median, width = 0.5)+
scale_y_continuous(breaks=seq(from=-7, by=1, to=0), limits = c(-7, 0))+
xlab(NULL)
botulinum %>%
ggpaired(x = "treatment",
y = "scores",
color = "treatment",
id = "id",
palette = "Dark2",
line.color = "gray",
line.size = 0.4,
xlab = "Treatment",
ylab = "Scores", legend = "none")+
scale_y_continuous(breaks=seq(from =0, by=1, to=10), limits = c(0,10))
Again, this is based on a scoring system so would not meet the assumption of linearity, so we will go straight to running our non-parametric test.
Run the Wilcoxson paired test
botulinum %>%
arrange(id) %>%
wilcox_test(scores~treatment, paired = TRUE)
Answer: There is a significant difference pre- and post- treatment in ratings of muscle spasticity (W = 45, p = 0.00826).
Produce the final figure
botulinum %>%
arrange(id) %>%
wilcox_test(scores~treatment, paired = TRUE) -> bot.test
botulinum %>%
ggpaired(x = "treatment",
y = "scores",
color = "treatment",
id = "id",
palette = "Dark2",
line.color = "gray",
line.size = 0.4,
xlab = "Treatment",
ylab = "Scores", legend = "none")+
scale_y_continuous(breaks=seq(from =0, by=1, to=12), limits = c(0,12))+
stat_pvalue_manual(bot.test, label="Wilcoxon's signed-rank test
p = {p}", y.position = 11)
Does the average weight gain depend on the creatine group to which people were assigned?
Read in and plot the data
read_csv("Datasets to use/creatine.csv") -> creatine
creatine %>%
ggplot(aes(creatine, gain, colour=creatine))+
geom_jitter(height = 0, width=0.1, size=4, show.legend = FALSE)
creatine %>%
ggplot(aes(creatine, gain, fill=creatine))+
geom_boxplot(show.legend = FALSE)+
scale_fill_brewer(palette="Oranges")
The data do not look very normally distributed, and there look to be large differences in the variability across the groups - lets have a look at a QQ plot, normality test and levene’s test.
model <- aov(gain~creatine, data = creatine)
ggqqplot(residuals(model)) +
theme_bw()
ggqqplot(creatine, x = "gain", facet.by = "creatine")
shapiro_test(residuals(model))
creatine %>%
group_by(creatine) %>%
shapiro_test(gain)
creatine %>%
levene_test(gain ~ creatine)
The Shapiro test is highly significant and there are a couple of points outside the QQ plot confidence bands.
We may want to test for outliers in this dataset, to see which points are identified
creatine %>%
group_by(creatine) %>%
identify_outliers(gain)
We can see that 2 values in the once group are identified as extreme outliers, as well as one value in the no group. We could look to remove these, but that is a relatively high proportion of our data to exclude without a very good reason to do so (20% of the data). It therefore seems more reasonable to leave these in but use a non-parametric approach.
Run the Kruskal Wallis and post-hoc Dunn’s tests.
creatine %>%
kruskal_test(gain~creatine)
creatine %>%
dunn_test(gain~creatine)
Answer: This study did not demonstrate any effect of creatine on weight gain (H = 3.87, p = 0.145).
It may be that the very high point in the once group is actually an implausible amount of gain, and therefore we think this is a data error. Assuming this is the case, lets repeat our analysis after removal of this point.
creatine.outlier <- creatine %>%
filter(gain<4000)
creatine.outlier %>%
ggplot(aes(creatine, gain, fill=creatine))+
geom_boxplot(outlier.shape = NA, show.legend = FALSE)+
geom_jitter(height = 0, width=0.1, size=4, alpha=0.5, show.legend = FALSE)+
scale_fill_brewer(palette="Oranges")
model <- aov(gain~creatine, data = creatine.outlier)
ggqqplot(residuals(model)) +
theme_bw()
ggqqplot(creatine.outlier, x = "gain", facet.by = "creatine")
shapiro_test(residuals(model))
creatine.outlier %>%
group_by(creatine) %>%
shapiro_test(gain)
creatine.outlier %>%
levene_test(gain ~ creatine)
Even after removal of the extreme outlier in the once group, the data still do not pass normality tests and also fail the assumption of homogeneity of variance, so we will stick with a non-parametric approach.
creatine.outlier %>%
kruskal_test(gain~creatine)
creatine.outlier %>%
dunn_test(gain~creatine)
After removing the outlier we get a significant result on the omnibus part of the test (H=6.0, p=0.0498) but none of the pairwise comparisons reach significance.
Which violin is the best according to the 10 violinists?
Read in and plot the data
read_csv("Datasets to use/violin.csv") -> violin
violin %>%
ggplot(aes(Violin, Score)) +
geom_violin()+
geom_point(aes(colour=Rater)) +
geom_line(aes(group = Rater, colour=Rater))
This is based on a scoring system, so we will use a non-parametric appraoch, meaning we can now run the test.
violin %>%
friedman_test(Score ~ Violin|Rater)
violin %>%
arrange(Rater) %>%
wilcox_test(Score ~ Violin, paired = TRUE, p.adjust.method = "bonferroni")
There is a significant difference between the three violins (p=0.0053), with violin A being ranked higher than violin C (p=0.017).
violin %>%
arrange(Rater) %>%
wilcox_test(Score ~ Violin, paired = TRUE, p.adjust.method = "bonferroni") %>%
add_xy_position(step.increase = 0.4) %>%
mutate(y.position = y.position + 2)-> violin.test
violin %>%
ggplot(aes(Violin, Score)) +
geom_violin(trim=FALSE, aes(fill=Violin), alpha = 0.5, show.legend = FALSE)+
geom_point(aes(colour=Rater), size=4) +
geom_line(aes(group = Rater, colour=Rater), size=1) +
scale_y_continuous(breaks=seq(from =0, by=1, to=14), limits = c(0,14))+
scale_fill_brewer(palette="Pastel1")+
stat_pvalue_manual(violin.test, label = "p = {p.adj}")
results.sig <- violin.test %>%
filter(p.adj<0.05)
violin %>%
ggplot(aes(Violin, Score)) +
geom_violin(aes(fill=Violin), alpha = 0.5, show.legend = FALSE)+
geom_line(aes(group = Rater), size=0.8, colour = "grey") +
geom_point(aes(colour=Rater), size=4) +
scale_y_continuous(breaks=seq(from =2, by=1, to=11), limits = c(2,11))+
scale_fill_brewer(palette="Accent")+
geom_signif(comparisons = list(c("Violin A","Violin C")),
annotations = results.sig$p.adj,
y_position = c(10, 11),
colour = "black",
show.legend = FALSE)
Is social dominance associated with parasitism?
Read in and plot the data
read_csv("Datasets to use/dominance.csv") -> dominance
dominance %>%
ggplot(aes(rank, eggs.per.gram))+
geom_col(fill="midnightblue", colour="black", size=1)+
scale_x_continuous(breaks=seq(1:6))+
scale_y_continuous(breaks = seq(0, 6000, 1000))
dominance %>%
ggplot(aes(rank, eggs.per.gram))+
geom_line()+
scale_x_continuous(breaks=seq(1:6))+
scale_y_continuous(breaks = seq(0, 6000, 1000))
It looks like there probably is a relationship, so lets run a Spearman Rank test
dominance %>%
cor_test(rank,eggs.per.gram, method = "spearman")
Answer: the relationship between dominance and parasitism is significant (ρ = -0.94, p = 0.0167) with high ranking males harbouring a heavier burden.
dominance %>%
ggplot(aes(rank, eggs.per.gram))+
geom_col(fill="midnightblue", colour="black", size=1)+
scale_x_continuous(breaks=seq(1:6))+
scale_y_continuous(breaks = seq(0, 6000, 1000))+
annotate("text", label = "rho=-0.94, p=0.017", x=5, y=5500, size=8)+
ggtitle("Association between dominance and parasitism")
Set up the packages we need.
library(tidyverse)
library(rstatix)
library(ggpubr)
theme_set(theme_bw(base_size = 14))
Is there an association between reward and dancing in dogs?
Read in and have a look at the data.
read_csv("Datasets to use/dogs.csv") -> dogs
dogs
The dogs data is already in a contingency table format, so we do not need to reformat it for the test. However, to plot the data we need to restructure it, so that we have proportions that can be plotted.
dogs %>%
mutate(Total = No+Yes) %>%
pivot_longer(cols= 2:4, names_to = "Dance", values_to = "count" ) -> dogs.long
dogs.long %>%
group_by(Training) %>%
mutate(fraction = count/count[Dance =="Total"]) %>%
filter(Dance != "Total") %>%
ungroup() -> dogs.long
dogs.long %>%
ggplot(aes(x=Training, y=fraction, fill=Dance)) +
geom_col(colour="black")+
scale_fill_brewer(palette = 1)
We have a 2x2 table so will use Fisher’s Exact test.
dogs %>%
select(No,Yes) %>%
fisher_test(detailed = TRUE)
For dogs, there is no significant association between dancing and reward. This is in contrast to cats, who are more likely to dance having received food.
Is the proportion of cane toads infected by intestinal parasites the same in 3 different areas of Queensland?
Read in and plot - we again need to restructure in order to do this.
cane.toad <- read_csv("Datasets to use/cane.toad.csv")
cane.toad %>%
pivot_longer(cols= 2:4, names_to = "area", values_to = "count" ) -> cane.toad.long
cane.toad.long %>%
group_by(area) %>%
mutate(fraction = count/count[Infection =="Total"]) %>%
filter(Infection != "Total") %>%
ungroup() -> cane.toad.long
cane.toad.long %>%
ggplot(aes(x=area, y=fraction, fill=Infection)) +
geom_col(colour="black")+
scale_fill_brewer(palette = 2)
First, look at the overall differences
cane.toad %>%
filter(Infection != "Total") %>%
select(-Infection) %>%
chisq_test()
There is a significant association between infection and area (p=0.00154), so lets do some pairwise comparisons to see which areas have significantly different proportions.
For this, if we convert our “Infection” column to rownames, then these will be used in the analysis output, making it easier to read and interpret.
cane.toad %>%
filter(Infection != "Total") %>%
column_to_rownames("Infection") %>%
pairwise_fisher_test(p.adjust.method = "holm")->toad.test
toad.test
The proportion of cane toads infected by intestinal parasites is significantly lower in Bowden than Mackay and Rockhampton (adjusted p=0.0037 and 0.045, respectively).
Lets put it on a graph
cane.toad.long %>%
ggplot(aes(x=area, y=fraction)) +
geom_col(colour="black", aes(fill=Infection))+
scale_fill_brewer(palette = 2)+
scale_y_continuous(breaks=seq(from =0, by=0.2, to=1.25), limits = c(0,1.25))+
stat_pvalue_manual(toad.test, y.position = c(1.05, 1.13, 1.17), label = "p = {p.adj}")+
annotate("text", label="Fisher's exact tests with Holm correction", x=2, y=1.25, size=5)
Set up the packages we need.
library(tidyverse)
library(rstatix)
library(ggpubr)
library(ggsignif)
library(gglm)
theme_set(theme_bw(base_size = 14))
Does the mean petal length of iris flowers differ according to their species?
Start by reading in and plotting the data.
iris.species <- read_csv("Datasets to use/iris.species.csv")
iris.species
iris.species %>%
ggplot(aes(Species,Petal.Length, colour = Species)) +
geom_jitter(height=0, width=0.2, size = 2, show.legend = FALSE) +
stat_summary(geom="crossbar", fun=mean, width=0.6, linewidth=0.3, show.legend = FALSE)
iris.species %>%
ggplot(aes(x=Species, y=Petal.Length)) +
geom_boxplot()
iris.species %>%
ggplot(aes(x=Species, y=Petal.Length))+
geom_violin(trim=TRUE, linewidth=1, show.legend=FALSE)+
geom_jitter(height=0, width=0.2, size = 2, show.legend = FALSE) +
ylab("Length (cm)")
This looks very clear cut, so you hardly need to do stats at all! However, the data look like they might be deviating a little from normality, with the Virginia species looking slightly skewed, and they look to have potentially different variability.
Lets have a look at the formal tests.
ggqqplot(iris.species, x = "Petal.Length", facet.by = "Species") +
theme_bw()
model <- lm(Petal.Length ~ Species, data = iris.species)
ggqqplot(residuals(model)) +
theme_bw()
shapiro_test(residuals(model))
iris.species %>%
group_by(Species) %>%
shapiro_test(Petal.Length)
iris.species %>%
levene_test(Petal.Length ~ Species)
As suspected, there are points outside the confidence bands on the QQ plot, and the Virginica species fails the Shapiro-Wilk test. The Levene’s test is also significant, indicating there is not homogeneity of variance.
Therefore, lets run the non-parametric Mann-Whitney test.
iris.species %>%
wilcox_test(Petal.Length~Species)
Answer: the Virginica species of iris has significantly longer petal length than the Setosa species (P<0.0001)
Lets put the result on a final graph.
iris.species %>%
wilcox_test(Petal.Length~Species) %>%
add_xy_position()-> smelly.test
iris.species %>%
ggplot(aes(x=Species, y=Petal.Length, colour=Species))+
geom_jitter(height=0, width=0.1, size=3, alpha = 0.8, show.legend = FALSE)+
scale_color_brewer(palette="Dark2")+
stat_summary(geom="crossbar", fun=median, width=0.6, linewidth=0.3, show.legend = FALSE) +
scale_y_continuous(breaks=seq(from=0, by=1, to=8), limits = c(0, 8))+
stat_pvalue_manual(smelly.test, label="Mann-Whitney test, p = {p}", size=5)
You want to know whether the mean petal length of iris flowers correlates with the mean sepal length in the virginica species.
read_csv("Datasets to use/iris.sepal.csv") -> iris.sepal
iris.sepal
iris.sepal %>%
ggplot(aes(Sepal.Length, Petal.Length)) +
geom_point(size=3)
There looks to be a strong outlier, so lets fit a line of best fit and look at the diagnostic tests.
lm(Petal.Length ~ Sepal.Length, data=iris.sepal)-> fit.sepal
coefficients(fit.sepal) -> coef.sepal
summary(fit.sepal)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris.sepal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3240 -0.2041 0.1031 0.3069 0.6280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8660 1.0940 -1.706 0.101
## Sepal.Length 1.0961 0.1645 6.662 6.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5956 on 24 degrees of freedom
## Multiple R-squared: 0.649, Adjusted R-squared: 0.6344
## F-statistic: 44.38 on 1 and 24 DF, p-value: 6.849e-07
iris.sepal$cook <- cooks.distance(fit.sepal)
iris.sepal$residual <- rstandard(fit.sepal)
iris.sepal
par(mfrow=c(2,2))
plot(fit.sepal)
The outlier looks like it is having an impact on our model. We can go back to the raw data (in iris.data) and see that this point actually comes from another iris species (setosa) and was incorrectly copied across, so we can definitely remove this.
Lets remove and fit a new model.
iris.sepal %>%
filter(cook<1) -> iris.sepal.outlier
lm(Petal.Length ~ Sepal.Length, data=iris.sepal.outlier)-> fit.sepal.outlier
coefficients(fit.sepal.outlier) -> coef.sepal.outlier
summary(fit.sepal.outlier)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris.sepal.outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61228 -0.20779 -0.03223 0.19445 0.38323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49857 0.54875 0.909 0.373
## Sepal.Length 0.75561 0.08185 9.231 3.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2681 on 23 degrees of freedom
## Multiple R-squared: 0.7875, Adjusted R-squared: 0.7782
## F-statistic: 85.22 on 1 and 23 DF, p-value: 3.381e-09
par(mfrow=c(2,2))
plot(fit.sepal.outlier)
The fit looks a lot more reasonable now - lets run a Pearson’s correlation test and plot this on a graph.
iris.sepal.outlier %>%
cor_test(Sepal.Length, Petal.Length)
iris.sepal.outlier %>%
ggplot(aes(Sepal.Length, Petal.Length)) +
geom_point(size=3) +
geom_abline(intercept = coef.sepal.outlier[1], slope = coef.sepal.outlier[2])+
annotate("text", label = "r = 0.89, p = 3.38e-09, r2 = 78.75%", x=6.3, y=6.4, size=6)
There is a significant positive relationship between petal length and sepal length (r=0.89, r2=0.7875, p=3.38e-09).
Which intervention should the city use to maximise household recycling?
Read in and have a look at the data.
read_csv("Datasets to use/recycling.csv") -> recycling
recycling
Lets calculate proportions, so the data can be plotted on a graph.
recycling %>%
mutate(Total = Recycle+Not.recycle) %>%
pivot_longer(cols= 2:4, names_to = "Recycles", values_to = "count" ) -> recycling.long
recycling.long %>%
group_by(Intervention) %>%
mutate(fraction = count/count[Recycles =="Total"]) %>%
filter(Recycles != "Total") %>%
ungroup() -> recycling.long
recycling.long %>%
ggplot(aes(x=Intervention, y=fraction, fill=Recycles)) +
geom_col(colour="black")+
scale_fill_brewer(palette = 4)
Do a Chi-squared test, and pairwise comparisons between the interventions.
recycling %>%
select(Recycle,Not.recycle) %>%
chisq_test()
recycling %>%
column_to_rownames("Intervention") %>%
pairwise_fisher_test(p.adjust.method = "holm")->recycling.test
recycling.test
There is a significant difference between our interventions and recycling uptake (p=0.0075). Both interventions are significantly better than the control group (adjusted p=0.036 for both) but there is no significant difference between the use of a flyer or a phone call. Therefore, choice of intervention may come down to other considerations, incl. cost, resources, time, etc.
Lets put it on a graph
recycling.long %>%
ggplot(aes(x=Intervention, y=fraction)) +
geom_col(colour="black", aes(fill=Recycles))+
scale_fill_brewer(palette = 4)+
stat_pvalue_manual(recycling.test, y.position = c(1.05, 1.10, 1.15), label = "p = {p.adj}")+
annotate("text", label="Fisher's exact tests with Holm correction", x=2, y=1.25, size=5)
Generate a linear model describing the relationship between income and happiness. How happy would you expect someone earing 50k to be, based on your model?
read_csv("Datasets to use/income.data.csv") -> income
income
income %>%
ggplot(aes(income, happiness)) +
geom_point(size=2)
Lets fit a line of best fit and look at the diagnostic tests.
lm(happiness ~ income, data=income)-> fit.income
coefficients(fit.income) -> coef.income
summary(fit.income)
##
## Call:
## lm(formula = happiness ~ income, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
income$cook <- cooks.distance(fit.income)
income$residual <- rstandard(fit.income)
income
par(mfrow=c(2,2))
plot(fit.income)
There do not seem to be any outliers or other issues with our data, so lets plot it.
income %>%
ggplot(aes(income, happiness)) +
geom_point(size=2) +
geom_abline(intercept = coef.income[1], slope = coef.income[2],
colour = "blue", linewidth = 1)+
xlab("Income (x10^-4)")
We now want to calculate how happy someone earning 50k would be, based on the model. We can get the information on the line of best fit from the model summary.
summary(fit.income)
##
## Call:
## lm(formula = happiness ~ income, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
Using the model intercept and slope (income coefficient), we can predict the happiness score we would expect for someone earning £50k
happy.50 <- 0.204 + 0.714*5
happy.50
## [1] 3.774
Answer: From the linear regression happiness = 0.204 + 0.714 x income(x10^-4) [y-int + slope x income] so if earning 50k, happiness=0.204+0.714*5=3.774.
Lets also add the line of best fit formula to the graph. We could also show our predicted point, if we wanted.
income %>%
ggplot(aes(income, happiness)) +
geom_point(size=2) +
geom_abline(intercept = coef.income[1], slope = coef.income[2],
colour = "blue", linewidth = 1)+
xlab("Income (x10^-4)") +
annotate("text", label = "Happiness=0.204+0.714*income(x10^-4)", x=3.5, y=6.5, size=6)
income %>%
ggplot(aes(income, happiness)) +
geom_point(size=2) +
geom_abline(intercept = coef.income[1], slope = coef.income[2],
colour = "blue", linewidth = 1)+
annotate("point", x=5, y=happy.50, colour = "red", size = 4)+
xlab("Income (x10^-4)") +
annotate("text", label = "Happiness=0.204+0.714*income(x10^-4)", x=3.5, y=6.5, size=6)
Which beans should you buy?
coffee <- read_csv("Datasets to use/coffee.country.csv")
coffee
coffee %>%
ggplot(aes(x=Country, y=Score, colour=Country))+
geom_jitter(height=0, width=0.3, size=3)+
stat_summary(geom="crossbar", fun=mean, width=0.6, linewidth=0.3, show.legend = FALSE)
coffee %>%
ggplot(aes(x=Country, y=Score, colour=Country))+
geom_boxplot(show.legend = FALSE)
coffee %>%
ggplot(aes(x=Country, y=Score, colour=Country))+
geom_violin(show.legend = FALSE)+
geom_jitter(height=0, width=0.25, alpha=0.5, size=3, show.legend = FALSE)
It looks like there might be a couple of outliers, particularly the low point in the Costa Rica group - lets have a look at the QQ plot and normality tests.
model <- aov(Score ~ Country, data = coffee)
ggqqplot(residuals(model)) +
theme_bw()
shapiro_test(residuals(model))
coffee %>%
group_by(Country) %>%
shapiro_test(Score)
coffee %>%
levene_test(Score ~ Country)
We can also test for outliers
coffee %>%
group_by(Country) %>%
identify_outliers(Score)
This dataset again has a question of whether or not to remove outliers. In fact, there is only one extreme outlier identified and it seems likely that the outliers are real results, as they are not implausible. Therefore, doing the analysis with the outliers included is likely preferable, unless there is a reason to exclude them or there was a decision to exclude any identified outliers prior to data analysis.
Including the potential outliers, the data fail normality tests and there are multiple points outside the confidence bands on the QQ plot. Also, given that the measure is a scoring system, where non-parametric tests are generally used, we will use a Kruskal-Wallis test.
coffee %>%
kruskal_test(Score ~ Country)
coffee %>%
dunn_test(Score ~ Country)
Answer: Overall, there is an association between the country the beans are from and their score (H=23.65, p=2.95e-05). Coffee beans from Ethiopia receive significantly higher scores than those from China (p=0.0164), Costa Rica (p=8.64e-05), and Indonesia (p=3.55-04).
Finally, lets put our results on a plot.
coffee %>%
dunn_test(Score ~ Country) %>%
filter(p.adj<0.05) -> coffee.sig
coffee %>%
ggplot(aes(Country, Score, colour=Country)) +
geom_jitter(height=0, width=0.1, alpha=0.5, size=4, show.legend = FALSE)+
stat_summary(geom="crossbar", fun=median, width=0.6, linewidth=0.3, show.legend = FALSE) +
geom_signif(comparisons = list(c("China","Ethiopia"),
c("Costa Rica","Ethiopia"),
c("Ethiopia","Indonesia")),
annotations = signif(coffee.sig$p.adj, digits=3),
y_position = c(91, 92.5,93.5),
colour = "black",
show.legend = FALSE)
coffee %>%
ggplot(aes(Country, Score, colour=Country)) +
geom_jitter(height=0, width=0.1, alpha=0.5, size=4, show.legend = FALSE)+
stat_summary(geom="crossbar", fun=median, width=0.6, linewidth=0.3, show.legend = FALSE) +
stat_pwc(method = "dunn_test", label = "p.adj.signif", hide.ns = TRUE, show.legend = FALSE,
)